Although looking for a home that is right for you largely comes down to personal preferences, the two cities on the east and west coast in the US - New York City and San Francisco - should always be on home-buyers’ top lists. Both of them have a wonderful blend of elegant sophistication and wild spectacles. New York has the most stunning skyline views, unique cuisines of the world and fabulous Broadway shows while living in San Francisco means living with sunshine and relaxation. However, the reality is that housing in New York and San Francisco are so expensive and competitive. This led us to consider getting insights of the housing prices in these two cities before settling down. In our project, we will give you insights into the real estate prices in New York and San Francisco. Meanwhile, we are also interested in comparing important factors that influence the house prices between two cities.
One of our team members is planning to move to San Francisco or New York after graduation, so she would like to compare housing prices in the two cities. She would also like to gain insights into housing prices based on some specific factors, and estimate the price of the houses or condos based on those variables. We are also inspired by the amazing contents learnt in this course, so we plan to work with complex, real-world real estate data, and build reliable model based on those data to produce meaningful results.
The goal of this project is to apply exploratory data analysis and predictive modeling to understand the real estate prices in New York and San Francisco from buyers’ and seller’s perspectives. Regarding our project topic, we proposed several research questions: What are the significant factors that influence the home prices in the two cities, respectively? Should I purchase a current listing house now? How much will the real estates be if I have spcific tastes over houses?
All the data we collected and built model upon are from Zillow API Network, including GetDeepSearch-Results API (GetUpdatedPropertyDetails API).
We first scrapped the property address and zip codes by loop through New York and San Francisco’s Zillow listing and recently sold properties (because the scrapping R code is too long to put here, please see details in nyaddress.R and sfaddress.R on github). Then we use Python pyZillow package (see details in the following link to github) to get the API results by calling address and zip codes provided by the previous step. In addition, we also scrapped the listing price and url of currently listing properties by loop their zpid (the id used by Zillow).
# Loading API data
library(readr)
sf_data <- read.csv("data/sf_feature.csv")
ny_data <- read.csv("data/ny_feature.csv")
# Do the basic clean of API data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
colnames(sf_data)
## [1] "X" "zillow_id" "latitude"
## [4] "longitude" "year_built" "bathrooms"
## [7] "bedrooms" "last_sold_date" "last_sold_price"
## [10] "property.size" "home_type" "tax_year"
## [13] "tax_value" "home_size" "zipcode"
sf_data <- sf_data %>%
select(-X)
ny_data <- ny_data %>%
select(-X)
Since scrapping the listing prices and urls requires a certain format of last_sold_date, we first conduct wrangling over last_sold_date to combine the listing prices and urls into the dataset.
# Convert the format of last_sold_date
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
sf_data <- sf_data %>%
mutate(last_sold_date = mdy(last_sold_date))
ny_data <- ny_data %>%
mutate(last_sold_date = mdy(last_sold_date))
Since scrapping the current listing prices of the properties will loop through all the urls everytime we run the code, we output the scrapped data into a csv file and we read the data from the csv file thereafter.
# Find the listing prices of current listing properties in San Francisco and New York
library(stringr)
library(rvest)
library(tidyr)
# Find the current listing, not recently sold properties
z_id <-sf_data %>% filter(last_sold_date < "2013-04-01 UTC" | is.na(last_sold_date))%>%select(zillow_id)
z_idny <- ny_data %>% filter(last_sold_date < "2013-04-01 UTC" | is.na(last_sold_date))%>%select(zillow_id)
zpid<- as.character(z_id$zillow_id)
zpid_ny <- as.character(z_idny$zillow_id)
url_base <- "http://www.zillow.com/homes/for_sale/"
urls <- unique(paste0(url_base, zpid,"_zpid"))
urlsny <- unique(paste0(url_base, zpid_ny,"_zpid"))
# Get the listing information for a specific listing property by its url
findlisting <- function(url) {
list <- read_html(url) %>%
html_nodes(".main-row span") %>%
html_text()
#data_frame(list,url)
tryCatch(data_frame(list,url),error=function(e) NULL)
}
# Get the zpid which matches those the original data frame, and url for later use
listing <-bind_rows(lapply(urls, findlisting))
listing <- listing %>% filter(list!="") %>% mutate(zillow_id=url)
listing$zillow_id<-as.numeric(gsub("_zpid","",(gsub("http://www.zillow.com/homes/for_sale/","",listing$url))))
listing <- listing %>% filter(grepl("^\\$",list))
sf_data <- left_join(sf_data,listing,by="zillow_id")
# Do the same thing for NY
listingny <-bind_rows(lapply(urlsny, findlisting))
listingny <- listingny %>% filter(list!="") %>% mutate(zillow_id=url)
listingny$zillow_id<-as.numeric(gsub("_zpid","",(gsub("http://www.zillow.com/homes/for_sale/","",listingny$url))))
listingny <- listingny %>% filter(grepl("^\\$",list))
ny_data <- left_join(ny_data,listingny,by="zillow_id")
# Output the data to a csv file with original information from API, listing price and urls
write.csv(sf_data, file = "data/sflisting.csv",row.names=FALSE)
write.csv(ny_data, file = "data/nylisting.csv",row.names=FALSE)
In this part, we conduct exploratory data analysis and data cleaning to prepare for the model building stage later. We prepare two types of data set in this part, the first one is a data set where all the observations with NA values in the variables are removed; the second one is a data set where the observations with NA values in the variables are retained but all the missing values are imputed based on the median of the variable according to the groups given by the value of other variables of the observation. We maintain these two types of data set because when we remove all the obvservations with missing values in the variables, the data set of the two cities will become really small compared to the original data set. In later stage, we build models based on both types of data set and see which one will give us a better result.
Firstly, we remove NA values and obvious outliers in the variables based on common sense from the data set, which will lead to bad performance during the model building stage. The obvious outliers we found are listed below:
Note that last_sold_price will be the outcome in our model so we remove all those observations with NA values in last_sold_price.
sf_data <- read_csv("data/sflisting.csv")
ny_data <- read_csv("data/nylisting.csv")
sf_data <- sf_data %>%
mutate(property_size = property.size) %>%
select(-property.size) %>%
filter(!is.na(zipcode)) %>%
filter(bedrooms != 999 | is.na(bedrooms)) %>%
filter(!is.na(last_sold_price) & last_sold_price <= 3.5e7 & last_sold_price >= 1e5) %>%
filter(home_size > 20 & home_size < 10^5) %>%
distinct()
ny_data <- ny_data %>%
mutate(property_size = property.size) %>%
select(-property.size) %>%
filter(bedrooms != 255 | is.na(bedrooms)) %>%
filter(!is.na(last_sold_price) & last_sold_price <= 5.8e7 & last_sold_price >= 1e5) %>%
filter(home_size > 20 & home_size < 10^5) %>%
distinct()
Since the data is collected from Zillow, there are a lot of missingness in the data set we got. We now assess the missingness of our data to get a general idea of the completeness of our data.
#install.packages("ForImp")
library(ForImp)
library(knitr)
library(scales)
# Compute the number of missing values in each variable
miss <- data.frame(missingness(as.matrix(sf_data))$missing_values_per_variable)
# Reorganize the data frame
rownum <-row.names(miss)
percentage <- percent(miss[,1]/nrow(sf_data))
miss <- miss %>% mutate(percentage = percentage)
miss <- miss[,c(2,1)]
colnames(miss) <- c("percentage","number of missingness")
row.names(miss) <- rownum
miss %>% kable(caption="Missingness of variables in SF data")
| percentage | number of missingness | |
|---|---|---|
| zillow_id | 0.0% | 0 |
| latitude | 0.0% | 0 |
| longitude | 0.0% | 0 |
| year_built | 1.1% | 149 |
| bathrooms | 3.0% | 417 |
| bedrooms | 10.8% | 1509 |
| last_sold_date | 0.0% | 0 |
| last_sold_price | 0.0% | 0 |
| home_type | 0.0% | 0 |
| tax_year | 2.6% | 362 |
| tax_value | 2.6% | 362 |
| home_size | 0.0% | 0 |
| zipcode | 0.0% | 0 |
| list | 98.3% | 13704 |
| url | 98.3% | 13704 |
| property_size | 40.0% | 5577 |
# Do the same thing for NY
miss_ny <- data.frame(missingness(as.matrix(ny_data))$missing_values_per_variable)
rownum_ny <-row.names(miss_ny)
percentage_ny <- percent(miss_ny[,1]/nrow(ny_data))
miss_ny <- miss_ny %>% mutate(percentage = percentage_ny)
miss_ny <- miss_ny[,c(2,1)]
colnames(miss_ny) <- c("percentage","number of missingness")
row.names(miss_ny) <- rownum_ny
miss_ny %>% kable(caption="Missingness of variables in NY data")
| percentage | number of missingness | |
|---|---|---|
| zillow_id | 0.0% | 0 |
| latitude | 0.0% | 0 |
| longitude | 0.0% | 0 |
| year_built | 4.2% | 295 |
| bathrooms | 28.4% | 2010 |
| bedrooms | 26.1% | 1844 |
| last_sold_date | 0.0% | 0 |
| last_sold_price | 0.0% | 0 |
| home_type | 0.0% | 0 |
| tax_year | 6.6% | 469 |
| tax_value | 6.6% | 469 |
| home_size | 0.0% | 0 |
| zipcode | 0.0% | 0 |
| list | 90.4% | 6389 |
| url | 90.4% | 6389 |
| property_size | 21.5% | 1519 |
We then conduct exploratory data analysis over the variables in the data set to see whether there still exists outliers that may become influential points in the future model building stage and to see whether we need to transform the variables to produce better prediction results in the model.
# Check which variables are potential variables in the model
# and require transformation
colnames(sf_data)
## [1] "zillow_id" "latitude" "longitude"
## [4] "year_built" "bathrooms" "bedrooms"
## [7] "last_sold_date" "last_sold_price" "home_type"
## [10] "tax_year" "tax_value" "home_size"
## [13] "zipcode" "list" "url"
## [16] "property_size"
# year_built, bathrooms, bedrooms, last_sold_price, property_size, home_type, tax_year, tax_value, home_size, zipcode
par(mfrow = c(2, 4), oma=c(0,0,2,0))
vars <- c(4:6, 8, 10:12, 16)
for (i in vars){
hist(as.matrix(sf_data[,i]), main = colnames(sf_data)[i])
}
title("Explorative graphs for variables in San Francisco Data", outer =TRUE)
table(sf_data$home_type)
##
## Apartment Condominium Cooperative
## 82 5829 34
## Duplex Miscellaneous Mobile
## 515 47 2
## MultiFamily2To4 MultiFamily5Plus SingleFamily
## 870 14 6439
## Townhouse Unknown VacantResidentialLand
## 94 4 15
table(sf_data$zipcode)
##
## 94102 94103 94104 94105 94107 94108 94109 94110 94111 94112 94114 94115
## 342 501 45 555 1214 139 799 1102 108 950 782 557
## 94116 94117 94118 94121 94122 94123 94124 94127 94131 94132 94133 94134
## 793 620 531 589 750 533 525 536 758 352 229 482
## 94158
## 153
dev.off()
## null device
## 1
Based on the results above, we decide which variables are required to be transformed and what range of each variables we should use in our model; then we filter out those observations with the values of variables outside the range. The criteria of the range are presented below between the code for this cleaning step:
library(ggplot2)
# Remove bathrooms >=9.5 and NA
table(sf_data$bathrooms)
##
## 0.5 1 1.1 1.25 1.3 1.5 1.75 1.8 2 2.1 2.2 2.25 2.3 2.5 2.75
## 1 4877 6 22 3 483 7 3 4940 1 2 22 1 550 6
## 3 3.25 3.3 3.5 3.75 4 4.4 4.5 5 5.1 5.25 5.5 5.75 6 6.5
## 1416 6 2 301 7 441 1 110 167 1 1 25 1 64 15
## 7 7.5 8 8.5 9 10 11 12 12.5 17 27
## 20 6 8 1 4 1 1 1 2 1 1
sf_data <- sf_data %>%
filter(bathrooms < 9.5)
# Remove bedrooms >=16 and NA
table(sf_data$bedrooms)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 16 20
## 199 2111 4368 3220 1454 487 198 68 34 30 5 5 2 1 1
sf_data <- sf_data %>%
filter(bedrooms < 16)
# Transform last_sold_price and remove price = 1
ggplot(sf_data, aes(x = last_sold_price)) + geom_histogram(col="purple4",fill="white")+ggtitle("histogram of log last sold price in SF")+scale_x_log10()+xlab("log last sold price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sf_data <- sf_data %>%
filter(last_sold_price > 1) %>%
mutate(log_last_sold_price = log(last_sold_price))
# Filter out NA tax_value
sf_data <- sf_data %>%
filter(!is.na(tax_value))
ggplot(sf_data, aes(x = tax_value)) + geom_histogram(col="olivedrab4",fill="white")+ggtitle("histogram of log tax value in SF")+scale_x_log10()+xlab("log tax value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sf_data <- sf_data %>%
mutate(log_tax_value = log(tax_value))
# Remove NA home_size and transform home_size and property_size
ggplot(sf_data, aes(x = home_size)) + geom_histogram(col="blue",fill="white")+ggtitle("histogram of log home size in SF")+scale_x_log10()+xlab("log home size")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
sf_data <- sf_data %>%
filter(home_size > 1) %>%
mutate(log_home_size = log(home_size),
log_property_size = log(property_size))
# Do the same for ny_data
par(mfrow = c(2, 4), oma=c(0,0,2,0))
vars <- c(4:6, 8, 10:12, 16)
for (i in vars){
hist(as.matrix(ny_data[,i]), main = colnames(ny_data)[i])
}
title("Explorative graphs for variables in New York Data", outer =TRUE)
table(ny_data$home_type)
##
## Apartment Condominium Cooperative
## 125 1410 388
## Duplex Miscellaneous Mobile
## 538 27 4
## MultiFamily2To4 MultiFamily5Plus Quadruplex
## 1527 17 45
## SingleFamily Townhouse Triplex
## 2576 246 143
## Unknown VacantResidentialLand
## 1 19
table(ny_data$zipcode)
##
## 10001 10002 10003 10004 10005 10006 10007 10009 10010 10011 10012 10013
## 42 31 33 40 52 21 39 28 17 34 39 28
## 10014 10016 10017 10018 10019 10021 10022 10023 10024 10025 10026 10027
## 46 39 34 15 33 32 21 34 39 31 32 41
## 10028 10029 10030 10031 10032 10033 10034 10035 10036 10037 10038 10039
## 32 27 20 39 26 22 14 27 36 17 57 13
## 10040 10044 10065 10069 10075 10128 10280 10282 10301 10302 10303 10304
## 15 6 24 52 17 21 62 5 47 58 64 80
## 10305 10306 10307 10308 10309 10310 10312 10314 10451 10452 10453 10454
## 60 65 47 61 48 54 51 61 5 11 11 6
## 10455 10456 10457 10458 10459 10460 10461 10462 10463 10464 10465 10466
## 9 27 23 17 18 27 65 24 20 15 69 51
## 10467 10468 10469 10470 10471 10472 10473 10474 10475 11001 11004 11005
## 30 19 50 18 20 41 50 6 6 57 33 9
## 11040 11096 11101 11102 11103 11104 11105 11106 11109 11201 11203 11204
## 53 18 53 37 27 29 62 26 15 36 53 53
## 11205 11206 11207 11208 11209 11210 11211 11212 11213 11214 11215 11216
## 56 65 48 40 43 51 53 51 51 49 53 48
## 11217 11218 11219 11220 11221 11222 11223 11224 11225 11226 11228 11229
## 50 59 49 55 46 42 70 26 35 42 49 39
## 11230 11231 11232 11233 11234 11235 11236 11237 11238 11239 11354 11355
## 64 59 26 47 33 21 43 33 54 2 59 62
## 11356 11357 11358 11360 11361 11362 11363 11364 11365 11366 11367 11368
## 70 40 56 43 59 50 22 54 55 44 36 47
## 11369 11372 11373 11374 11375 11377 11378 11379 11385 11411 11412 11413
## 63 25 51 34 12 56 53 48 42 48 38 34
## 11414 11415 11416 11417 11418 11419 11420 11421 11422 11423 11426 11427
## 54 21 40 79 55 46 50 57 49 34 64 33
## 11428 11429 11432 11433 11434 11435 11436 11581 11691 11692 11694
## 41 59 44 71 46 37 61 1 55 39 49
dev.off()
## null device
## 1
# Remove bathrooms >= 14 and NA
table(ny_data$bathrooms)
##
## 0.5 1 1.1 1.5 1.75 2 2.25 2.5 2.75 3 3.1 3.5 3.75 4 4.5
## 5 1420 1 265 6 1557 1 312 4 907 1 96 3 265 25
## 5 5.5 6 6.5 7 7.5 8 8.5 9 9.5 10 16
## 111 7 39 6 8 1 7 1 2 2 3 1
ny_data <- ny_data %>%
filter(bathrooms < 14)
# Remove bedrooms >= 21 and NA
table(ny_data$bedrooms)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 15
## 202 754 910 1349 748 462 301 141 68 37 11 14 2 1 1
## 17
## 2
ny_data <- ny_data %>%
filter(bedrooms < 21)
# Transform last_sold_price and remove last_sold_price = 1
ggplot(ny_data, aes(x = last_sold_price)) + geom_histogram(col="darkred",fill="white")+ggtitle("histogram of log last sold price in NYC")+scale_x_log10()+xlab("log last sold price")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ny_data <- ny_data %>%
filter(last_sold_price > 1) %>%
mutate(log_last_sold_price = log(last_sold_price))
# Filter NA tax_value
ny_data <- ny_data %>%
filter(!is.na(tax_value))
ggplot(ny_data, aes(x = last_sold_price)) + geom_histogram(col="indianred2",fill="white")+ggtitle("histogram of log tax value in NYC")+scale_x_log10()+xlab("log tax value")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ny_data <- ny_data %>%
mutate(log_tax_value = log(tax_value))
# Remove NA home_size and transform home_size and property_size
ggplot(ny_data, aes(x = last_sold_price)) + geom_histogram(col="orange1",fill="white")+ggtitle("histogram of log home size in NYC")+scale_x_log10()+xlab("log home size")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ny_data <- ny_data %>%
filter(home_size > 1) %>%
mutate(log_home_size = log(home_size),
log_property_size = log(property_size))
Finally, we remove all the observations with NA values in the variables.
# Remove all the na values in the predictors
sf_data <- sf_data %>%
filter(!is.na(year_built) &
!is.na(last_sold_date) &
!is.na(home_type) &
!is.na(tax_year))
ny_data <- ny_data %>%
filter(!is.na(year_built) &
!is.na(last_sold_date) &
!is.na(home_type) &
!is.na(tax_year))
We compare the distribution of the variables before and after transformation in the two cities to see whether transformation of the variables will resolve the problem of some influential points in the data set.
library(gridExtra)
library(grid)
# Comparing the before and after transforming tax_value
tax_value_notran <- ggplot(sf_data, aes(x = tax_value)) + geom_histogram(colour = "darkgreen", fill = "white")+ggtitle("Before transformation")
tax_value_tran <- ggplot(sf_data, aes(x = tax_value)) + geom_histogram(colour = "white", fill = "darkgreen")+ggtitle("After log transformation")+scale_x_log10()
# Comparing the before and after transforming home_size
home_value_notran <- ggplot(sf_data, aes(x = home_size)) + geom_histogram(colour = "darkblue", fill = "white")+ggtitle("Before transformation")
home_value_tran <- ggplot(sf_data, aes(x = home_size)) + geom_histogram(colour = "white", fill = "darkblue")+ggtitle("After log transformation")+scale_x_log10()
grid.arrange(tax_value_notran, tax_value_tran, home_value_notran,home_value_tran, top = "Comparing Before and After Transformation for Variable in SF data")
# Comparing the before and after transforming tax_value
tax_value_notran <- ggplot(ny_data, aes(x = tax_value)) +
geom_histogram(colour = "red", fill = "white") +
ggtitle("Before transformation")
tax_value_tran <- ggplot(ny_data, aes(x = tax_value)) +
geom_histogram(colour = "white", fill = "red") +
ggtitle("After log transformation") +
scale_x_log10()
# Comparing the before and after transforming home_size
home_value_notran <- ggplot(ny_data, aes(x = home_size)) + geom_histogram(colour = "darkorange", fill = "white")+ggtitle("Before transformation")
home_value_tran <- ggplot(ny_data, aes(x = home_size)) + geom_histogram(colour = "white", fill = "darkorange")+ggtitle("After log transformation")+scale_x_log10()
grid.arrange(tax_value_notran, tax_value_tran, home_value_notran,home_value_tran, top = "Comparing Before and After Transformation for Variable in NY data")
Since longitude and latitude of the housing information could not be directly used in the models (magnitude of those values doesn’t make sense), we run k-means on these two variables to cluster the houses in the two cities according to their geographic location.
To find the optimal k in the k-means method, we want to find the k that minimizes the within cluster sum of squares. Here, we tried different values of k in the two cities and pick the minimum k that could keep the within cluster sum of squares within a reasonable scale.
# SF
sf_location <- cbind(sf_data$latitude, sf_data$longitude)
colnames(sf_location) <- c("latitude", "longitude")
mydata <- sf_location
wss <- (nrow(mydata)-1) * sum(apply(mydata,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata,
centers=i, iter.max = 20)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main = "Within groups sum of squares vs. Number of clusters")
# NY
ny_location <- cbind(ny_data$latitude, ny_data$longitude)
colnames(ny_location) <- c("latitude", "longitude")
mydata <- ny_location
wss <- (nrow(mydata)-1) * sum(apply(mydata,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata,
centers=i, iter.max = 20)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main = "Within groups sum of squares vs. Number of clusters")
According to the above plots, we see that when k is greater than 15, the within cluster sum of squares won’t decrease much in magnitude, so we choose k as 15. Then we run k-means on the date set in both cities and merge the cluster results into the original data set.
library(ggplot2)
# SF
cl <- kmeans(sf_location, 15, iter.max = 20)
sf_location <- data.frame(sf_location, cl$cluster)
colnames(sf_location)[3] <- "cluster"
ggplot(sf_location, aes(longitude, latitude, col = as.character(cluster))) +
geom_point(alpha = 0.5) +
theme(legend.position="none")
## bind the cluster into the data
sf_data <- cbind(sf_data, sf_location$cluster)
colnames(sf_data)[21] <- "location_cluster"
sf_data <- sf_data %>%
mutate(location_cluster = as.character(location_cluster))
# NY
cl <- kmeans(ny_location, 15, iter.max = 20)
ny_location <- data.frame(ny_location, cl$cluster)
colnames(ny_location)[3] <- "cluster"
ggplot(ny_location, aes(longitude, latitude, col = as.character(cluster))) +
geom_point(alpha = 0.5) +
theme(legend.position="none")
## bind the cluster into the data
ny_data <- cbind(ny_data, ny_location$cluster)
colnames(ny_data)[21] <- "location_cluster"
ny_data <- ny_data %>%
mutate(location_cluster = as.character(location_cluster))
Here we want to get a general idea of the distribution of the outcome - last_sold_price over the region in the two cities. Since we already see that last_sold_price need log transformation in previous parts, we plot last_sold_price in log scale over the location of the real estates in the two cities. Becuase there are too many properties in a relatively small area in both cities, we seperate those properites according to their sold prices into three groups, and plot them seperately.
library(ggmap)
#plot the log(last_sold_price) by location,
sf_data_new <- sf_data %>%
filter(!is.na(log_last_sold_price))
loghigh <-qmplot(longitude, latitude, data = sf_data_new %>%
filter(log_last_sold_price>14.5 & log_last_sold_price <19),
colour = log_last_sold_price) +
scale_colour_gradient(low = "red",
high="green",
guide= guide_legend(title = "log of last sold price \nfrom 14.5 to above 19 \n(more than $1983000)")) +
ggtitle("San Francisco Last Sold Price Graph I")
logmid <- qmplot(longitude, latitude, data = sf_data_new %>% filter(log_last_sold_price>13.5 & log_last_sold_price <=14.5), colour = log_last_sold_price)+scale_colour_gradient(low = "yellow",high="blue",na.value="white",guide= guide_legend(title = "log of last sold price \nfrom 13.5 to 14.5 \n($730000 to $1982999)")) + ggtitle("San Francisco Last Sold Price Graph II")
loglow <-qmplot(longitude, latitude, data = sf_data_new %>% filter(log_last_sold_price>=0 & log_last_sold_price <=13.5), colour = log_last_sold_price)+scale_colour_gradient(low = "black",high="lightblue3",na.value="white", guide= guide_legend(title = "log of last sold price \nfrom 0 to 13.5 \n($1 to $729999)"))+ ggtitle("San Francisco Last Sold Price Graph III")
loghigh
logmid
loglow
# Do the same thing for NY
# Plot the log(last_sold_price) by location
ny_data_new <- ny_data %>% filter(!is.na(log_last_sold_price))
loghigh <-qmplot(longitude, latitude, data = ny_data_new %>% filter(log_last_sold_price>14), colour = log_last_sold_price)+scale_colour_gradient(low = "red",high="green",guide= guide_legend(title = "log of last sold price \nfrom 14 \n(more than $1200000)")) + ggtitle("New York Last Sold Price Graph I")
logmid <- qmplot(longitude, latitude, data = ny_data_new %>% filter(log_last_sold_price>13 & log_last_sold_price <=14), colour = log_last_sold_price)+scale_colour_gradient(low = "yellow",high="blue",na.value="white",guide= guide_legend(title = "log of last sold price \nfrom 13 to 14 \n($442000 to $1199999)")) + ggtitle("New York Last Sold Price Graph II")
loglow <-qmplot(longitude, latitude, data = ny_data_new %>% filter(log_last_sold_price>=0 & log_last_sold_price <=13), colour = log_last_sold_price)+scale_colour_gradient(low = "black",high="lightblue3",na.value="white", guide= guide_legend(title = "log of last sold price \nfrom 0 to 13 \n(below $442000)"))+ ggtitle("New York Last Sold Price Graph III")
loghigh
logmid
loglow
From the above figures, we could easily see the price of properties in different regions of the two cities. We see that properties in the northeast region of SF have higher prices and properties in the southwest region of SF have lower prices; meanwhile, properties in Manhattan clearly have higher prices and properties around or far from Manhattan in NY have lower prices.
Next we examine the correlation between variables in the data set in the two cities.
# Evaluate the correlation between numerical values in SF data
cor_sf <- cor(sf_data %>%
filter(!is.na(log_property_size)) %>%
mutate(sold_date = as.numeric(last_sold_date)) %>%
dplyr::select(year_built, latitude, longitude, bathrooms, bedrooms, sold_date, log_tax_value, log_home_size, log_property_size))
library(reshape2)
melted_cor_sf <- melt(cor_sf)
ggplot(data = melted_cor_sf, aes(x=Var1, y=Var2, fill=value, label = value)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
scale_fill_gradient2(low = "white",
high = "red",
midpoint = 0,
limit = c(-1,1),
space = "Lab",
name="Pearson\nCorrelation") +
geom_text(aes(Var2, Var1, label = format(value, digits = 1)),
color = "black",
size = 3) +
labs(title = "Correlation of variables in SF data")
# Evaluate the correlation between categorical variable with the numerica values in SF
yr_home <-ggplot(data= sf_data, aes(factor(home_type), year_built)) +
geom_boxplot(aes(fill=home_type), alpha=0.7) +
ylab("year built") +
xlab("") +
theme(axis.line=element_blank(),
axis.text.x=element_blank())
loca_home <- ggplot(data= sf_data,
aes(factor(home_type), as.numeric(location_cluster))) +
geom_boxplot(aes(fill=home_type), alpha=0.7) +
ylab("location cluster") +
xlab("") +
theme(axis.line=element_blank(), axis.text.x = element_blank())
bath_home <- ggplot(data= sf_data, aes(factor(home_type), bathrooms)) +
geom_boxplot(aes(fill=home_type), alpha=0.7) +
xlab("home type") +
theme(axis.line=element_blank(), axis.text.x=element_blank())
bed_home <- ggplot(data = sf_data, aes(factor(home_type), bedrooms)) +
geom_boxplot(aes(fill=home_type), alpha=0.7) +
xlab("home type") +
theme(axis.line=element_blank(), axis.text.x=element_blank())
date_home <- ggplot(data = sf_data, aes(factor(home_type), last_sold_date)) +
geom_boxplot(aes(fill=home_type), alpha=0.7) +
ylab("last sold date") +
xlab("") +
theme(axis.line=element_blank(), axis.text.x=element_blank()) +
theme(legend.position = "none")
tax_home <- ggplot(data= sf_data, aes(factor(home_type), log_tax_value)) +
geom_boxplot(aes(fill=home_type), alpha=0.7) +
ylab("log tax value") +
xlab("") +
theme(axis.line=element_blank(),axis.text.x=element_blank()) +
theme(legend.position = "none")
size_home <- ggplot(data= sf_data, aes(factor(home_type), log_home_size)) + geom_boxplot(aes(fill=home_type), alpha=0.7) + ylab("log home size")+xlab("home type") +theme(axis.line=element_blank(),axis.text.x=element_blank())+theme(legend.position = "none")
property_home <- ggplot(data= sf_data %>% filter(!is.na(log_property_size)), aes(factor(home_type), log_property_size)) + geom_boxplot(aes(fill=home_type), alpha=0.7) + ylab("log property size")+xlab("home type") +theme(axis.line=element_blank(),axis.text.x=element_blank())+theme(legend.position = "none")
# Make those plots share the legend to save space
grid_arrange_shared_legend <- function(...) {
plots <- list(...)
g <- ggplotGrob(plots[[1]] + theme(legend.position="bottom"))$grobs
legend <- g[[which(sapply(g, function(x) x$name) == "guide-box")]]
lheight <- sum(legend$height)
grid.arrange(
do.call(arrangeGrob, lapply(plots, function(x)
x + theme(legend.position="none"))),
legend,
ncol = 1,
heights = unit.c(unit(1, "npc") - lheight, lheight))
}
grid_arrange_shared_legend(date_home, tax_home, size_home, property_home)
grid_arrange_shared_legend(yr_home, loca_home, bath_home, bed_home)
# Do the same thing for NY
# Evaluate the correlation between numerical values in NY data
cor_ny <- cor(ny_data %>%
filter(!is.na(log_property_size)) %>%
mutate(sold_date = as.numeric(last_sold_date)) %>%
dplyr::select(year_built, latitude, longitude, bathrooms, bedrooms, sold_date, log_tax_value, log_home_size, log_property_size))
library(reshape2)
melted_cor_ny <- melt(cor_ny)
ggplot(data = melted_cor_ny, aes(x=Var1, y=Var2, fill=value, label = value)) +
geom_tile() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
scale_fill_gradient2(low = "white",
high = "red",
midpoint = 0,
limit = c(-1,1),
space = "Lab",
name="Pearson\nCorrelation") +
geom_text(aes(Var2, Var1, label = format(value, digits = 1)),
color = "black",
size = 3) +
labs(title = "Correlation of variables in NY data")
#evaluate the correlation between categorical variable with the numerica values in NY
yr_home <-ggplot(data= ny_data, aes(factor(home_type), year_built)) + geom_boxplot(aes(fill=home_type), alpha=0.7) +ylab("year built") +xlab("")+theme(axis.line=element_blank(),axis.text.x=element_blank())
loca_home <- ggplot(data= ny_data, aes(factor(home_type), as.numeric(location_cluster))) + geom_boxplot(aes(fill=home_type), alpha=0.7) + ylab("location cluster")+xlab("")+theme(axis.line=element_blank(),axis.text.x=element_blank())
bath_home <- ggplot(data= ny_data, aes(factor(home_type), bathrooms)) + geom_boxplot(aes(fill=home_type), alpha=0.7) +xlab("home type") +theme(axis.line=element_blank(),axis.text.x=element_blank())
bed_home <- ggplot(data= ny_data, aes(factor(home_type), bedrooms)) + geom_boxplot(aes(fill=home_type), alpha=0.7) +xlab("home type") +theme(axis.line=element_blank(),axis.text.x=element_blank())
date_home <- ggplot(data= ny_data, aes(factor(home_type), last_sold_date)) + geom_boxplot(aes(fill=home_type), alpha=0.7) +ylab("last sold date") +xlab("")+theme(axis.line=element_blank(),axis.text.x=element_blank())+theme(legend.position = "none")
tax_home <- ggplot(data= ny_data, aes(factor(home_type), log_tax_value)) + geom_boxplot(aes(fill=home_type), alpha=0.7) +ylab("log tax value") +xlab("")+theme(axis.line=element_blank(),axis.text.x=element_blank())+theme(legend.position = "none")
size_home <- ggplot(data= ny_data, aes(factor(home_type), log_home_size)) + geom_boxplot(aes(fill=home_type), alpha=0.7) + ylab("log home size") +
theme(axis.line=element_blank(),axis.text.x=element_blank())+theme(legend.position = "none")
property_home <- ggplot(data= ny_data %>% filter(!is.na(log_property_size)), aes(factor(home_type), log_property_size)) + geom_boxplot(aes(fill=home_type), alpha=0.7) + ylab("log property size") +theme(axis.line=element_blank(),axis.text.x=element_blank())+theme(legend.position = "none")
grid_arrange_shared_legend(date_home, tax_home, size_home, property_home)
grid_arrange_shared_legend(yr_home, loca_home, bath_home, bed_home)
From the heatmaps of the two cities, we could see that the correlation between the numeric variables of in the data set is quite small. Additionally, we don’t observe any obvious trend in the figures when examining the correlation between numeric and categorical variables. Therefore, we conclude that there is not large correlation between the predictors in the model.
Firsr, we remove the obvious outliers in the variables based on common sense from the data set, which will lead to bad performance during the model building stage. The obvious outliers we found are again listed below:
sf_data_impute <- read_csv("data/sflisting.csv")
ny_data_impute <- read_csv("data/nylisting.csv")
sf_data_impute <- sf_data_impute %>%
mutate(property_size = property.size) %>%
select(-property.size) %>%
filter(!is.na(zipcode)) %>%
filter(bedrooms != 999 | is.na(bedrooms)) %>%
filter(!is.na(last_sold_price) & last_sold_price <= 3.5e7 & last_sold_price >= 1e5) %>%
filter(is.na(home_size) | (home_size > 20 & home_size < 10^5)) %>%
distinct()
ny_data_impute <- ny_data_impute %>%
mutate(property_size = property.size) %>%
select(-property.size) %>%
filter(bedrooms != 255 | is.na(bedrooms)) %>%
filter(!is.na(last_sold_price) & last_sold_price <= 5.8e7 & last_sold_price >= 1e5) %>%
filter(is.na(home_size) | (home_size > 20 & home_size < 10^5)) %>%
distinct()
We cut the continuous variables into 10 categories to assign the properties into different categories for later imputation.
sf_data_impute <- sf_data_impute %>%
mutate(last_sold_price_q10 = cut(last_sold_price, breaks = as.numeric(quantile(last_sold_price, (0:10)/10, na.rm = T)), labels = c(1:10), include.lowest = T))
sf_data_impute <- sf_data_impute %>%
mutate(home_size_q10 = cut(home_size, breaks = as.numeric(quantile(home_size, (0:10)/10, na.rm = T)), labels = c(1:10), include.lowest = T))
ny_data_impute <- ny_data_impute %>%
mutate(last_sold_price_q10 = cut(last_sold_price, breaks = as.numeric(quantile(last_sold_price, (0:10)/10, na.rm = T)), labels = c(1:10), include.lowest = T))
ny_data_impute <- ny_data_impute %>%
mutate(home_size_q10 = cut(home_size, breaks = as.numeric(quantile(home_size, (0:10)/10, na.rm = T)), labels = c(1:10), include.lowest = T))
We impute the missing values of each variable based on the median of the variable in the group the property belongs to, where the group is generated according to the values of other variables of the property (for continuous variables, we group the properties according to the categories we generated above). When the values of the variable is all missing in a certain group, we remove one of the variables that forms the groups of the properties, regroup the properties and impute the missing values again. We iterate this process until all the missing values in the data set are imputed.
# year_built imputed according to zipcode, home_type, home_size, last_sold_price (decreasing importance)
sf_data_impute <- sf_data_impute %>%
group_by(zipcode, home_type,last_sold_price_q10, home_size_q10) %>%
mutate(mean_year_built = round(median(year_built, na.rm = T))) %>%
ungroup %>%
mutate(year_built = ifelse(is.na(year_built), mean_year_built, year_built)) %>%
group_by(zipcode, home_type, home_size_q10) %>%
mutate(mean_year_built = round(median(year_built, na.rm = T))) %>%
ungroup %>%
mutate(year_built = ifelse(is.na(year_built), mean_year_built, year_built)) %>%
group_by(zipcode, home_type) %>%
mutate(mean_year_built = round(median(year_built, na.rm = T))) %>%
ungroup %>%
mutate(year_built = ifelse(is.na(year_built), mean_year_built, year_built)) %>%
group_by(zipcode) %>%
mutate(mean_year_built = round(median(year_built, na.rm = T))) %>%
ungroup %>%
mutate(year_built = ifelse(is.na(year_built), mean_year_built, year_built)) %>%
group_by(home_type) %>%
mutate(mean_year_built = round(median(year_built, na.rm = T))) %>%
ungroup %>%
mutate(year_built = ifelse(is.na(year_built), mean_year_built, year_built)) %>%
dplyr::select(-mean_year_built)
sf_data_impute <- sf_data_impute %>%
mutate(year_built_q10 = cut(year_built, breaks = as.numeric(quantile(year_built, (0:10)/10, na.rm = T)), labels = c(1:10), include.lowest = T))
# bathrooms imputed according to bedrooms, home_size, home_type, last_sold_price, year_built, zipcode (decreasing importance)
sf_data_impute <- sf_data_impute %>%
group_by(bedrooms, home_size_q10, home_type, last_sold_price_q10, year_built_q10, zipcode) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
group_by(bedrooms, home_size_q10, home_type, last_sold_price_q10, zipcode) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
group_by(bedrooms, home_size_q10, home_type, zipcode) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
group_by(bedrooms, home_type, zipcode) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
group_by(home_type, zipcode) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
group_by(home_type) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
dplyr::select(-median_bathrooms)
sf_data_impute <- sf_data_impute %>%
mutate(bathrooms_int = round(bathrooms))
# bedrooms imputed according to bathrooms, home_size, home_type, last_sold_price, year_built, zipcode (decreasing importance)
sf_data_impute <- sf_data_impute %>%
group_by(bathrooms_int, home_size_q10, home_type, last_sold_price_q10, year_built_q10, zipcode) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(bathrooms_int, home_size_q10, home_type, last_sold_price_q10, year_built_q10) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(bathrooms_int, home_size_q10, home_type, last_sold_price_q10) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(bathrooms_int, home_size_q10, home_type) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(bathrooms_int, home_size_q10) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(bathrooms_int) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(home_size_q10) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
dplyr::select(-median_bedrooms)
# tax_year imputed by mode of tax_year
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
sf_data_impute <- sf_data_impute %>%
mutate(tax_year = ifelse(is.na(tax_year), getmode(tax_year), tax_year))
# tax_value imputed by last_sold_price, tax_year, home_size, home_type, zipcode, year_built
sf_data_impute <- sf_data_impute %>%
group_by(last_sold_price_q10,tax_year, home_size_q10, home_type, zipcode, year_built_q10) %>%
mutate(median_tax_value = median(tax_value, na.rm = T)) %>%
ungroup %>%
mutate(tax_value = ifelse(is.na(tax_value), median_tax_value, tax_value)) %>%
group_by(last_sold_price_q10,tax_year, home_size_q10, home_type, zipcode) %>%
mutate(median_tax_value = median(tax_value, na.rm = T)) %>%
ungroup %>%
mutate(tax_value = ifelse(is.na(tax_value), median_tax_value, tax_value)) %>%
group_by(last_sold_price_q10,tax_year, home_size_q10, home_type) %>%
mutate(median_tax_value = median(tax_value, na.rm = T)) %>%
ungroup %>%
mutate(tax_value = ifelse(is.na(tax_value), median_tax_value, tax_value)) %>%
group_by(last_sold_price_q10,tax_year, home_size_q10) %>%
mutate(median_tax_value = median(tax_value, na.rm = T)) %>%
ungroup %>%
mutate(tax_value = ifelse(is.na(tax_value), median_tax_value, tax_value)) %>%
group_by(last_sold_price_q10,tax_year) %>%
mutate(median_tax_value = median(tax_value, na.rm = T)) %>%
ungroup %>%
mutate(tax_value = ifelse(is.na(tax_value), median_tax_value, tax_value)) %>%
dplyr::select(-median_tax_value)
# home_size imputed by last_sold_price, bedrooms, bathrooms, home_type
sf_data_impute <- sf_data_impute %>%
mutate(bedrooms_int = round(bedrooms))
sf_data_impute <- sf_data_impute %>%
group_by(last_sold_price_q10, bedrooms_int, bathrooms_int, home_type) %>%
mutate(median_home_size = median(as.numeric(home_size), na.rm = T)) %>%
ungroup %>%
mutate(home_size = ifelse(is.na(home_size), median_home_size, home_size)) %>%
group_by(last_sold_price_q10, bedrooms_int, bathrooms_int) %>%
mutate(median_home_size = median(as.numeric(home_size), na.rm = T)) %>%
ungroup %>%
mutate(home_size = ifelse(is.na(home_size), median_home_size, home_size)) %>%
group_by(last_sold_price_q10, bedrooms_int) %>%
mutate(median_home_size = median(as.numeric(home_size), na.rm = T)) %>%
ungroup %>%
mutate(home_size = ifelse(is.na(home_size), median_home_size, home_size)) %>%
dplyr::select(-median_home_size)
# property_size imputed by home_size, last_sold_price, bedrooms, bathrooms, home_type
sf_data_impute <- sf_data_impute %>%
mutate(home_size_q10 = cut(home_size, breaks = as.numeric(quantile(home_size, (0:10)/10, na.rm = T)), labels = c(1:10), include.lowest = T))
sf_data_impute <- sf_data_impute %>%
group_by(home_size_q10, last_sold_price_q10, bedrooms_int, bathrooms_int, home_type) %>%
mutate(median_property_size = median(as.numeric(property_size), na.rm = T)) %>%
ungroup %>%
mutate(property_size = ifelse(is.na(property_size), median_property_size, property_size)) %>%
group_by(home_size_q10, last_sold_price_q10, bedrooms_int, bathrooms_int) %>%
mutate(median_property_size = median(as.numeric(property_size), na.rm = T)) %>%
ungroup %>%
mutate(property_size = ifelse(is.na(property_size), median_property_size, property_size)) %>%
group_by(home_size_q10, last_sold_price_q10, bedrooms_int) %>%
mutate(median_property_size = median(as.numeric(property_size), na.rm = T)) %>%
ungroup %>%
mutate(property_size = ifelse(is.na(property_size), median_property_size, property_size)) %>%
group_by(home_size_q10, last_sold_price_q10) %>%
mutate(median_property_size = median(as.numeric(property_size), na.rm = T)) %>%
ungroup %>%
mutate(property_size = ifelse(is.na(property_size), median_property_size, property_size)) %>%
dplyr::select(-median_property_size)
# Do the same thing for NY
# year_built imputed according to zipcode, home_type, home_size, last_sold_price (decreasing importance)
ny_data_impute <- ny_data_impute %>%
group_by(zipcode, home_type,last_sold_price_q10, home_size_q10) %>%
mutate(mean_year_built = round(median(year_built, na.rm = T))) %>%
ungroup %>%
mutate(year_built = ifelse(is.na(year_built), mean_year_built, year_built)) %>%
group_by(zipcode, home_type, home_size_q10) %>%
mutate(mean_year_built = round(median(year_built, na.rm = T))) %>%
ungroup %>%
mutate(year_built = ifelse(is.na(year_built), mean_year_built, year_built)) %>%
group_by(zipcode, home_type) %>%
mutate(mean_year_built = round(median(year_built, na.rm = T))) %>%
ungroup %>%
mutate(year_built = ifelse(is.na(year_built), mean_year_built, year_built)) %>%
group_by(zipcode) %>%
mutate(mean_year_built = round(median(year_built, na.rm = T))) %>%
ungroup %>%
mutate(year_built = ifelse(is.na(year_built), mean_year_built, year_built)) %>%
group_by(home_type) %>%
mutate(mean_year_built = round(median(year_built, na.rm = T))) %>%
ungroup %>%
mutate(year_built = ifelse(is.na(year_built), mean_year_built, year_built)) %>%
select(-mean_year_built)
ny_data_impute <- ny_data_impute %>%
mutate(year_built_q10 = cut(year_built, breaks = as.numeric(quantile(year_built, (0:10)/10, na.rm = T)), labels = c(1:10), include.lowest = T))
# bathrooms imputed according to bedrooms, home_size, home_type, last_sold_price, year_built, zipcode (decreasing importance)
ny_data_impute <- ny_data_impute %>%
group_by(bedrooms, home_size_q10, home_type, last_sold_price_q10, year_built_q10, zipcode) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
group_by(bedrooms, home_size_q10, home_type, last_sold_price_q10, zipcode) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
group_by(bedrooms, home_size_q10, home_type, zipcode) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
group_by(bedrooms, home_type, zipcode) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
group_by(home_type, zipcode) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
group_by(home_type) %>%
mutate(median_bathrooms = median(bathrooms, na.rm = T)) %>%
ungroup %>%
mutate(bathrooms = ifelse(is.na(bathrooms), median_bathrooms, bathrooms)) %>%
select(-median_bathrooms)
ny_data_impute <- ny_data_impute %>%
mutate(bathrooms_int = round(bathrooms))
# bedrooms imputed according to bathrooms, home_size, home_type, last_sold_price, year_built, zipcode (decreasing importance)
ny_data_impute <- ny_data_impute %>%
group_by(bathrooms_int, home_size_q10, home_type, last_sold_price_q10, year_built_q10, zipcode) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(bathrooms_int, home_size_q10, home_type, last_sold_price_q10, year_built_q10) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(bathrooms_int, home_size_q10, home_type, last_sold_price_q10) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(bathrooms_int, home_size_q10, home_type) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(bathrooms_int, home_size_q10) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
group_by(bathrooms_int) %>%
mutate(median_bedrooms = median(as.numeric(bedrooms), na.rm = T)) %>%
ungroup %>%
mutate(bedrooms = ifelse(is.na(bedrooms), median_bedrooms, bedrooms)) %>%
select(-median_bedrooms)
# tax_year imputed by mode of tax_year
ny_data_impute <- ny_data_impute %>%
mutate(tax_year = ifelse(is.na(tax_year), getmode(tax_year), tax_year))
# tax_value imputed by last_sold_price, tax_year, home_size, home_type, zipcode, year_built
ny_data_impute <- ny_data_impute %>%
group_by(last_sold_price_q10,tax_year, home_size_q10, home_type, zipcode, year_built_q10) %>%
mutate(median_tax_value = median(tax_value, na.rm = T)) %>%
ungroup %>%
mutate(tax_value = ifelse(is.na(tax_value), median_tax_value, tax_value)) %>%
group_by(last_sold_price_q10,tax_year, home_size_q10, home_type, zipcode) %>%
mutate(median_tax_value = median(tax_value, na.rm = T)) %>%
ungroup %>%
mutate(tax_value = ifelse(is.na(tax_value), median_tax_value, tax_value)) %>%
group_by(last_sold_price_q10,tax_year, home_size_q10, home_type) %>%
mutate(median_tax_value = median(tax_value, na.rm = T)) %>%
ungroup %>%
mutate(tax_value = ifelse(is.na(tax_value), median_tax_value, tax_value)) %>%
group_by(last_sold_price_q10,tax_year, home_size_q10) %>%
mutate(median_tax_value = median(tax_value, na.rm = T)) %>%
ungroup %>%
mutate(tax_value = ifelse(is.na(tax_value), median_tax_value, tax_value)) %>%
group_by(last_sold_price_q10,tax_year) %>%
mutate(median_tax_value = median(tax_value, na.rm = T)) %>%
ungroup %>%
mutate(tax_value = ifelse(is.na(tax_value), median_tax_value, tax_value)) %>%
select(-median_tax_value)
# home_size imputed by last_sold_price, bedrooms, bathrooms, home_type
ny_data_impute <- ny_data_impute %>%
mutate(bedrooms_int = round(bedrooms))
ny_data_impute <- ny_data_impute %>%
group_by(last_sold_price_q10, bedrooms_int, bathrooms_int, home_type) %>%
mutate(median_home_size = median(as.numeric(home_size), na.rm = T)) %>%
ungroup %>%
mutate(home_size = ifelse(is.na(home_size), median_home_size, home_size)) %>%
group_by(last_sold_price_q10, bedrooms_int, bathrooms_int) %>%
mutate(median_home_size = median(as.numeric(home_size), na.rm = T)) %>%
ungroup %>%
mutate(home_size = ifelse(is.na(home_size), median_home_size, home_size)) %>%
group_by(last_sold_price_q10, bedrooms_int) %>%
mutate(median_home_size = median(as.numeric(home_size), na.rm = T)) %>%
ungroup %>%
mutate(home_size = ifelse(is.na(home_size), median_home_size, home_size)) %>%
select(-median_home_size)
# property_size imputed by home_size, last_sold_price, bedrooms, bathrooms, home_type
ny_data_impute <- ny_data_impute %>%
mutate(home_size_q10 = cut(home_size, breaks = as.numeric(quantile(home_size, (0:10)/10, na.rm = T)), labels = c(1:10), include.lowest = T))
ny_data_impute <- ny_data_impute %>%
group_by(home_size_q10, last_sold_price_q10, bedrooms_int, bathrooms_int, home_type) %>%
mutate(median_property_size = median(as.numeric(property_size), na.rm = T)) %>%
ungroup %>%
mutate(property_size = ifelse(is.na(property_size), median_property_size, property_size)) %>%
group_by(home_size_q10, last_sold_price_q10, bedrooms_int, bathrooms_int) %>%
mutate(median_property_size = median(as.numeric(property_size), na.rm = T)) %>%
ungroup %>%
mutate(property_size = ifelse(is.na(property_size), median_property_size, property_size)) %>%
group_by(home_size_q10, last_sold_price_q10, bedrooms_int) %>%
mutate(median_property_size = median(as.numeric(property_size), na.rm = T)) %>%
ungroup %>%
mutate(property_size = ifelse(is.na(property_size), median_property_size, property_size)) %>%
group_by(home_size_q10, last_sold_price_q10) %>%
mutate(median_property_size = median(as.numeric(property_size), na.rm = T)) %>%
ungroup %>%
mutate(property_size = ifelse(is.na(property_size), median_property_size, property_size)) %>%
select(-median_property_size)
We redo the k-means since now the imputed data set becomes larger than the complete data set before. First, we need to find the optimal k.
# SF
sf_location <- cbind(sf_data_impute$latitude, sf_data_impute$longitude)
colnames(sf_location) <- c("latitude", "longitude")
mydata <- sf_location
wss <- (nrow(mydata)-1) * sum(apply(mydata,2,var))
for (i in 2:30) wss[i] <- sum(kmeans(mydata,
centers=i, iter.max = 20)$withinss)
plot(1:30, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main = "Within groups sum of squares vs. Number of clusters")
# NY
ny_location <- cbind(ny_data_impute$latitude, ny_data_impute$longitude)
colnames(ny_location) <- c("latitude", "longitude")
mydata <- ny_location
wss <- (nrow(mydata)-1) * sum(apply(mydata,2,var))
for (i in 2:30) wss[i] <- sum(kmeans(mydata,
centers=i, iter.max = 20)$withinss)
plot(1:30, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares", main = "Within groups sum of squares vs. Number of clusters")
According to the above plots, we see that when k is greater than 25, the within cluster sum of squares won’t decrease much in magnitude, so we choose k as 25. Then we run k-means on the date set in both cities and merge the cluster results into the original data set.
# SF
cl <- kmeans(sf_location, 25, iter.max = 20)
sf_location <- data.frame(sf_location, cl$cluster)
colnames(sf_location)[3] <- "cluster"
ggplot(sf_location, aes(longitude, latitude, col = as.character(cluster))) +
geom_point(alpha = 0.5) +
theme(legend.position="none")
## bind the cluster into the data
sf_data_impute <- cbind(sf_data_impute, sf_location$cluster)
colnames(sf_data_impute)[22] <- "location_cluster"
sf_data_impute <- sf_data_impute %>%
mutate(location_cluster = as.character(location_cluster))
# NY
cl <- kmeans(ny_location, 25, iter.max = 20)
ny_location <- data.frame(ny_location, cl$cluster)
colnames(ny_location)[3] <- "cluster"
ggplot(ny_location, aes(longitude, latitude, col = as.character(cluster))) +
geom_point(alpha = 0.5) +
theme(legend.position="none")
## bind the cluster into the data
ny_data_impute <- cbind(ny_data_impute, ny_location$cluster)
colnames(ny_data_impute)[22] <- "location_cluster"
ny_data_impute <- ny_data_impute %>%
mutate(location_cluster = as.character(location_cluster))
We apply linear regression, random forests and neural network to the data set. With neural network producing the best prediction result, we use model constructed by neural network to predict the prices of the houses in the two cities.
For both complete data set and the imputed data set, the data set is divided into train set and test set. We apply forward selection algorithm to build the linear regression model. The inclusion criteria of a predictor is set up as:
For linear regression model building, we first see whether transformed outcome or transformed predictors could produce better results. And then we add quadratic, two-way interactions and higher order terms to the model to find the final model.
We write a function to calculate the RMSE of different models.
RMSE <- function(true_ratings, predicted_ratings){
sqrt(mean((predicted_ratings - true_ratings)^2))
}
library(caret)
set.seed(1)
inTrain_sf <- createDataPartition(y=sf_data$last_sold_price, p=0.9)
sf_data_train <- slice(sf_data, inTrain_sf$Resample1)
sf_data_test <- slice(sf_data, -inTrain_sf$Resample1)
According to the process we mentioned at the beginning of this part, we fit the model with log transformed outcome and compare the result with untransformed outcome.
# fit model with log_transformed y
lm1 <- lm(log_last_sold_price ~ location_cluster + year_built + bathrooms + bedrooms + last_sold_date + home_size + home_type + tax_year + tax_value + as.character(zipcode), data = sf_data_train)
#model <- gsub("lm", "", lm1$call, perl=TRUE)[2]
model <- "Model with log transformed y and untransformed predictors"
R_square <- summary(lm1)$r.squared
adj_R_square <- summary(lm1)$adj.r.squared
RMSE_train <- RMSE(sf_data_train$last_sold_price, predict(lm1))
RMSE_test <- RMSE(sf_data_test$last_sold_price, predict(lm1, sf_data_test))
Model_results_sf <- data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = NA)
Model_results_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with log transformed y and untransformed predictors | 0.6157989 | 0.6137994 | 1757180 | 1632475 | NA |
We center the predictors related to year in the model to see whether this would produce different result.
# center years
sf_data_train <- sf_data_train %>%
mutate(centered_year_built = year_built - median(year_built),
centered_tax_year = tax_year - median(tax_year))
sf_data_test <- sf_data_test %>%
mutate(centered_year_built = year_built - median(year_built),
centered_tax_year = tax_year - median(tax_year))
lm2 <- lm(last_sold_price ~ location_cluster +
centered_year_built +
bathrooms +
bedrooms +
last_sold_date +
home_size +
home_type +
centered_tax_year +
tax_value +
as.character(zipcode),
data = sf_data_train)
#model <- gsub("lm", "", lm2$call, perl=TRUE)[2]
model <- "Model with centered years"
R_square <- summary(lm2)$r.squared
adj_R_square <- summary(lm2)$adj.r.squared
RMSE_train <- RMSE(sf_data_train$last_sold_price, predict(lm2))
RMSE_test <- RMSE(sf_data_test$last_sold_price, predict(lm2, sf_data_test))
Model_results_sf <- bind_rows(Model_results_sf, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = NA))
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
Model_results_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with log transformed y and untransformed predictors | 0.6157989 | 0.6137994 | 1757179.7 | 1632475 | NA |
| Model with centered years | 0.7157573 | 0.7142780 | 632535.3 | 527563 | NA |
# fit model with untransformed y
lm3 <- lm(last_sold_price ~ location_cluster +
year_built +
bathrooms +
bedrooms +
last_sold_date +
home_size +
home_type +
tax_year +
tax_value +
as.character(zipcode),
data = sf_data_train)
#model <- gsub("lm", "", lm3$call, perl=TRUE)[2]
model <- "Model with untransformed y and untransformed predictors"
R_square <- summary(lm3)$r.squared
adj_R_square <- summary(lm3)$adj.r.squared
RMSE_train <- RMSE(sf_data_train$last_sold_price, predict(lm3))
RMSE_test <- RMSE(sf_data_test$last_sold_price, predict(lm3, sf_data_test))
Model_results_sf <- bind_rows(Model_results_sf, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = NA))
Model_results_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with log transformed y and untransformed predictors | 0.6157989 | 0.6137994 | 1757179.7 | 1632475.3 | NA |
| Model with centered years | 0.7157573 | 0.7142780 | 632535.3 | 527563.0 | NA |
| Model with untransformed y and untransformed predictors | 0.7157573 | 0.7142780 | 632535.3 | 527583.3 | NA |
From the above table, we could see that log transformed outcome won’t improve the performance of our model; centered year variables produce the same results as the uncentered variables. We then add quadratic terms of the predictors to the model, and whether a certain quadratic term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with included quadratic terms is presented below.
# Add quadratic terms
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
lm4 <- lm(last_sold_price ~ location_cluster +
year_built +
bathrooms + I(bathrooms^2) +
bedrooms +
last_sold_date + I(as.numeric(last_sold_date)^2) +
home_size +
home_type +
tax_year + I(tax_year^2) +
tax_value +
as.character(zipcode),
data = sf_data_train)
#model <- gsub("lm", "", lm4$call, perl=TRUE)[2]
model <- "Model with quadratic terms contributing to the prediction results"
R_square <- summary(lm4)$r.squared
adj_R_square <- summary(lm4)$adj.r.squared
RMSE_train <- RMSE(sf_data_train$last_sold_price, predict(lm4))
RMSE_test <- RMSE(sf_data_test$last_sold_price, predict(lm4, sf_data_test))
p_lrtest <- lrtest(lm3, lm4)$`Pr(>Chisq)`[2]
Model_results_sf <- bind_rows(Model_results_sf, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
Model_results_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with log transformed y and untransformed predictors | 0.6157989 | 0.6137994 | 1757179.7 | 1632475.3 | NA |
| Model with centered years | 0.7157573 | 0.7142780 | 632535.3 | 527563.0 | NA |
| Model with untransformed y and untransformed predictors | 0.7157573 | 0.7142780 | 632535.3 | 527583.3 | NA |
| Model with quadratic terms contributing to the prediction results | 0.7298357 | 0.7283525 | 616671.9 | 509283.3 | 0 |
Two-way interaction terms of the predictors are added into the model, and whether a certain interaction term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with quadratic and two-way interaction terms is presented below.
# Add interactions
lm5 <- lm(last_sold_price ~ location_cluster +
year_built +
bathrooms + I(bathrooms^2) +
bedrooms +
last_sold_date + I(as.numeric(last_sold_date)^2) +
home_size +
home_type +
tax_year + I(tax_year^2) +
tax_value +
as.character(zipcode) +
year_built:last_sold_date + year_built:tax_year +
bathrooms:last_sold_date + bathrooms:home_size + bathrooms:tax_year +
bathrooms:tax_value +
bedrooms:tax_value +
last_sold_date:home_size + last_sold_date:tax_value +
home_size:tax_year +
tax_year:tax_value +
home_type:year_built + home_type:bathrooms + home_type:bedrooms +
home_type:last_sold_date + home_type:tax_year +
as.character(zipcode):bedrooms + as.character(zipcode):last_sold_date +
as.character(zipcode):home_size + as.character(zipcode):tax_year +
as.character(zipcode):tax_value,
data = sf_data_train)
#model <- gsub("lm", "", lm5$call, perl=TRUE)[2]
model <- "Model with quadratic and interaction terms contributing to the prediction results"
R_square <- summary(lm5)$r.squared
adj_R_square <- summary(lm5)$adj.r.squared
RMSE_train <- RMSE(sf_data_train$last_sold_price, predict(lm5))
RMSE_test <- RMSE(sf_data_test$last_sold_price, predict(lm5, sf_data_test))
## Warning in predict.lm(lm5, sf_data_test): prediction from a rank-deficient
## fit may be misleading
p_lrtest <- lrtest(lm4, lm5)$`Pr(>Chisq)`[2]
Model_results_sf <- bind_rows(Model_results_sf, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
Model_results_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with log transformed y and untransformed predictors | 0.6157989 | 0.6137994 | 1757179.7 | 1632475.3 | NA |
| Model with centered years | 0.7157573 | 0.7142780 | 632535.3 | 527563.0 | NA |
| Model with untransformed y and untransformed predictors | 0.7157573 | 0.7142780 | 632535.3 | 527583.3 | NA |
| Model with quadratic terms contributing to the prediction results | 0.7298357 | 0.7283525 | 616671.9 | 509283.3 | 0 |
| Model with quadratic and interaction terms contributing to the prediction results | 0.8023154 | 0.7980183 | 527505.0 | 497166.2 | 0 |
Higher order terms of the predictors are added into the model, and whether a certain higher order term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with quadratic, two-way interaction and higher order terms is presented below.
# Add higher order terms
lm6 <- lm(last_sold_price ~ location_cluster +
year_built +
bathrooms + I(bathrooms^2) + I(bathrooms^3) +
bedrooms +
last_sold_date + I(as.numeric(last_sold_date)^2) +
home_size +
home_type +
tax_year + I(tax_year^2) +
tax_value +
as.character(zipcode) +
year_built:last_sold_date + year_built:tax_year +
bathrooms:last_sold_date + bathrooms:home_size + bathrooms:tax_year +
bathrooms:tax_value +
bedrooms:tax_value +
last_sold_date:home_size + last_sold_date:tax_value +
home_size:tax_year +
tax_year:tax_value +
home_type:year_built + home_type:bathrooms + home_type:bedrooms +
home_type:last_sold_date + home_type:tax_year +
as.character(zipcode):bedrooms + as.character(zipcode):last_sold_date +
as.character(zipcode):home_size + as.character(zipcode):tax_year +
as.character(zipcode):tax_value,
data = sf_data_train)
model <- "Model with quadratic, interaction and higher order terms contributing to the prediction results"
R_square <- summary(lm6)$r.squared
adj_R_square <- summary(lm6)$adj.r.squared
RMSE_train <- RMSE(sf_data_train$last_sold_price, predict(lm6))
RMSE_test <- RMSE(sf_data_test$last_sold_price, predict(lm6, sf_data_test))
## Warning in predict.lm(lm6, sf_data_test): prediction from a rank-deficient
## fit may be misleading
p_lrtest <- lrtest(lm5, lm6)$`Pr(>Chisq)`[2]
Model_results_sf <- bind_rows(Model_results_sf, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
Model_results_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with log transformed y and untransformed predictors | 0.6157989 | 0.6137994 | 1757179.7 | 1632475.3 | NA |
| Model with centered years | 0.7157573 | 0.7142780 | 632535.3 | 527563.0 | NA |
| Model with untransformed y and untransformed predictors | 0.7157573 | 0.7142780 | 632535.3 | 527583.3 | NA |
| Model with quadratic terms contributing to the prediction results | 0.7298357 | 0.7283525 | 616671.9 | 509283.3 | 0.0000000 |
| Model with quadratic and interaction terms contributing to the prediction results | 0.8023154 | 0.7980183 | 527505.0 | 497166.2 | 0.0000000 |
| Model with quadratic, interaction and higher order terms contributing to the prediction results | 0.8024080 | 0.7980935 | 527381.4 | 497195.4 | 0.0256473 |
Since the model will be very hard to interpret if we include three-way interaction or more interaction terms into the model, so we present this as the final model of linear regression for San Francisco for the complete data set.
\[ \begin{align*} last\_sold\_price = &\beta_0 + \beta_1 location\_cluster + \beta_2 year\_built + \beta_3 bathrooms + \beta_4 bathrooms^2 + \beta_5 bathrooms^3 + \beta_6 bedrooms \\ &+ \beta_7 last\_sold\_date + \beta_8 last\_sold\_date^2 + \beta_9 home\_size + \beta_{10} home\_type + \beta_{11} tax\_year + \beta_{12} tax\_year^2 + \beta_{13} tax\_value \\ &+ \beta_{14} zipcode + \beta_{15} year\_built \times last\_sold\_date + \beta_{16} year\_built \times tax\_year + \beta_{17} bathrooms \times last\_sold\_date \\ &+ \beta_{18} bathrooms \times home\_size + \beta_{19} bathrooms\times tax\_year + \beta_{20} bathrooms \times tax\_value + \beta_{21} bedrooms \times tax\_value \\ &+ \beta_{22} last\_sold\_date\times home\_size + \beta_{23} last\_sold\_date\times tax\_value + \beta_{24}home\_size\times tax\_year + \beta_{25} tax\_year\times tax\_value\\ &+ \beta_{26}home\_type\times year\_built + \beta_{27}home\_type\times bathrooms + \beta_{28}home\_type\times bedrooms \\ &+\beta_{29}home\_type\times last\_sold\_date + \beta_{30} home\_type\times tax\_year + \beta_{31} zipcode \times bedrooms + \beta_{32} zipcode \times last\_sold\_date \\ &+\beta_{33} zipcode\times home\_size + \beta_{34} zipcode \times tax\_year +\beta_{35} zipcode\times tax_value \end{align*} \]
Note in the above model, we collapse the terms of categorical variables (zipcode, location_cluster, home_type) into a single term for the sake of simplicity.
Then we explore the variables of importance in the above model.
varimp_sf <- varImp(lm6, scale = T)
varimp_sf <- data.frame(variable = rownames(varimp_sf), varimp_sf)
rownames(varimp_sf) <- 1:length(varimp_sf$variable)
varimp_sf <- varimp_sf %>%
arrange(desc(Overall)) %>%
mutate(Overall = Overall / sum(varimp_sf$Overall) * 100) %>%
slice(1:20)
varimp_sf %>%
ggplot(aes(x=1:dim(.)[1], y = Overall)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_blank()) +
geom_line(color = "darkgrey") +
geom_point(color = "grey") +
scale_x_continuous(breaks=1:dim(varimp_sf)[1],
labels=varimp_sf$variable) +
labs(title = "Variable of Importance in San Francisco",
y = "Variance explained(%)")
From the above plot, we could see that last_sold_date, tax_value, home_size, some of the zipcode, the interaction between bathrooms and other variables and the interaction between bedrooms and other variables are all important variables that explained the most variance of the outcome in this model.
Then we do the same thing for New York.
set.seed(1)
inTrain_ny <- createDataPartition(y=ny_data$last_sold_price, p=0.9)
ny_data_train <- slice(ny_data, inTrain_ny$Resample1)
ny_data_test <- slice(ny_data, -inTrain_ny$Resample1)
According to the results from San Francisco, we first fit the model with untransformed outcome and predictors.
lm7 <- lm(last_sold_price ~ location_cluster +
year_built +
bathrooms +
bedrooms +
last_sold_date +
home_size +
home_type +
tax_year +
tax_value +
as.character(zipcode),
data = ny_data_train)
model <- "Model with untransformed y and untransformed predictors"
R_square <- summary(lm7)$r.squared
adj_R_square <- summary(lm7)$adj.r.squared
RMSE_train <- RMSE(ny_data_train$last_sold_price, predict(lm7))
RMSE_test <- RMSE(ny_data_test$last_sold_price, predict(lm7, ny_data_test))
## Warning in predict.lm(lm7, ny_data_test): prediction from a rank-deficient
## fit may be misleading
Model_results_ny <- data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = NA)
Model_results_ny %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with untransformed y and untransformed predictors | 0.3276206 | 0.2916758 | 1316364 | 967284.2 | NA |
We then add quadratic terms of the predictors to the model, and whether a certain quadratic term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with included quadratic terms is presented below.
# Add quadratic terms
lm8 <- lm(last_sold_price ~ location_cluster +
year_built +
bathrooms +
bedrooms +
last_sold_date +
home_size + I(home_size^2) +
home_type +
tax_year +
tax_value + I(tax_value^2) +
as.character(zipcode),
data = ny_data_train)
model <- "Model with quadratic terms contributing to the prediction results"
R_square <- summary(lm8)$r.squared
adj_R_square <- summary(lm8)$adj.r.squared
RMSE_train <- RMSE(ny_data_train$last_sold_price, predict(lm8))
RMSE_test <- RMSE(ny_data_test$last_sold_price, predict(lm8, ny_data_test))
## Warning in predict.lm(lm8, ny_data_test): prediction from a rank-deficient
## fit may be misleading
p_lrtest <- lrtest(lm7, lm8)$`Pr(>Chisq)`[2]
Model_results_ny <- bind_rows(Model_results_ny, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
Model_results_ny %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with untransformed y and untransformed predictors | 0.3276206 | 0.2916758 | 1316364 | 967284.2 | NA |
| Model with quadratic terms contributing to the prediction results | 0.3495884 | 0.3144587 | 1294682 | 963388.5 | 0 |
Two-way interaction terms of the predictors are added into the model, and whether a certain interaction term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with quadratic and two-way interaction terms is presented below.
# Add interactions
lm9 <- lm(last_sold_price ~ location_cluster +
year_built +
bathrooms + I(bathrooms^2) +
bedrooms +
last_sold_date + I(as.numeric(last_sold_date)^2) +
home_size +
home_type +
tax_year + I(tax_year^2) +
tax_value +
as.character(zipcode) +
year_built:home_size +
bathrooms:bedrooms + bathrooms:home_size +
home_size:tax_year +
tax_year:tax_value +
home_type:bathrooms + home_type:last_sold_date +
as.character(zipcode):bathrooms + as.character(zipcode):bedrooms,
data = ny_data_train)
model <- "Model with quadratic and interaction terms contributing to the prediction results"
R_square <- summary(lm9)$r.squared
adj_R_square <- summary(lm9)$adj.r.squared
RMSE_train <- RMSE(ny_data_train$last_sold_price, predict(lm9))
RMSE_test <- RMSE(ny_data_test$last_sold_price, predict(lm9, ny_data_test))
## Warning in predict.lm(lm9, ny_data_test): prediction from a rank-deficient
## fit may be misleading
p_lrtest <- lrtest(lm8, lm9)$`Pr(>Chisq)`[2]
Model_results_ny <- bind_rows(Model_results_ny, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
Model_results_ny %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with untransformed y and untransformed predictors | 0.3276206 | 0.2916758 | 1316364 | 967284.2 | NA |
| Model with quadratic terms contributing to the prediction results | 0.3495884 | 0.3144587 | 1294682 | 963388.5 | 0 |
| Model with quadratic and interaction terms contributing to the prediction results | 0.5973371 | 0.5305380 | 1018684 | 863657.1 | 0 |
Higher order terms of the predictors are added into the model, and whether a certain higher order term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. We found that all the higher order terms are not necessary in this model, so we present the final model for New York for the complete data set.
\[ \begin{align*} last\_sold\_price = &\beta_0 + \beta_1 location\_cluster + \beta_2 year\_built + \beta_3 bathrooms + \beta_4 bathrooms^2 + \beta_5 bedrooms + \beta_6 last\_sold\_date \\ &+ \beta_7 last\_sold\_date^2 + \beta_8 home\_size + \beta_{9} home\_type + \beta_{10} tax\_year + \beta_{11} tax\_year^2 + \beta_{12} tax\_value + \beta_{13} zipcode \\ &+ \beta_{14} year\_built \times home\_size + \beta_{15} bathrooms \times bedrooms + \beta_{16} bathrooms \times home\_size + \beta_{17} home\_size\times tax\_year \\ &+\beta_{18} tax\_year\times tax\_value + \beta_{19}home\_type \times bathrooms + \beta_{20}home\_type \times last\_sold\_date + \beta_{21} zipcode \times bathrooms\\ &+\beta_{22} zipcode \times bedrooms \\ \end{align*} \]
Note in the above model, we collapse the terms of categorical variables (zipcode, location_cluster, home_type) into a single term for the sake of simplicity.
Then we explore the variables of importance in the above model.
varimp_ny <- varImp(lm9, scale = T)
varimp_ny <- data.frame(variable = rownames(varimp_ny), varimp_ny)
rownames(varimp_ny) <- 1:length(varimp_ny$variable)
varimp_ny <- varimp_ny %>%
arrange(desc(Overall)) %>%
mutate(Overall = Overall / sum(varimp_ny$Overall) * 100) %>%
slice(1:20)
varimp_ny %>%
ggplot(aes(x=1:dim(.)[1], y = Overall)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_blank()) +
geom_line(color = "darkgrey") +
geom_point(color = "grey") +
scale_x_continuous(breaks=1:dim(varimp_sf)[1],
labels=varimp_ny$variable) +
labs(title = "Variable of Importance in New York",
y = "Variance explained(%)")
From the above plot, we could see that unlike the figure in San Francisco, where there is a variable explained much variance in the outcome with the variance explained by the remaining variables drops a lot, all the variables in this model explained a relatively small amount of variance of the outcome. Among those, bathrooms, tax_year, some of the interaction between zipcode and other variables, the interaction between home_type are all important variables that explained the most variance of the outcome in this model.
We see a huge gap when comparing this result to what we get in San Francisco, this may be due to the difference between the features of property prices of the two cities. We see that in both cities, bathrooms, home_size, some of the zipcode plays a large part in variables of importance, which makes intuitive sense. We also see that home_type is important in New York while last_sold_date is important in San Francisco.
We apply the previous model building process to the imputed data set this time to see whether imputation could give us better results.
#write.csv(sf_data_impute, "sf_data_imputed.csv", row.names = F)
#write.csv(ny_data_impute, "ny_data_imputed.csv", row.names = F)
sf_data_impute <- sf_data_impute %>%
dplyr::select(year_built, bathrooms, bedrooms, last_sold_date, home_size, home_type, tax_year, tax_value, zipcode, location_cluster, last_sold_price)
ny_data_impute <- ny_data_impute %>%
dplyr::select(year_built, bathrooms, bedrooms, last_sold_date, home_size, home_type, tax_year, tax_value, zipcode, location_cluster, last_sold_price)
set.seed(1)
inTrain_sf <- createDataPartition(y=sf_data_impute$last_sold_price, p=0.9)
sf_data_impute_train <- slice(sf_data_impute, inTrain_sf$Resample1)
sf_data_impute_test <- slice(sf_data_impute, -inTrain_sf$Resample1)
We fit the model with untransformed outcome and predictors and the result is presented below.
lm10<- lm(last_sold_price ~ year_built +
bathrooms +
bedrooms +
last_sold_date +
home_size +
home_type +
tax_year +
tax_value +
as.character(zipcode) +
location_cluster,
data = sf_data_impute_train)
model <- "Model with untransformed y and imputed untransformed predictors"
R_square <- summary(lm10)$r.squared
adj_R_square <- summary(lm10)$adj.r.squared
RMSE_train <- RMSE(sf_data_impute_train$last_sold_price, predict(lm10, sf_data_impute_train))
sf_data_impute_test <- sf_data_impute_test %>%
filter(home_type %in% unique(sf_data_impute_train$home_type))
RMSE_test <- RMSE(sf_data_impute_test$last_sold_price, predict(lm10, sf_data_impute_test))
Model_results_imputed_sf <- data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = NA)
Model_results_imputed_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with untransformed y and imputed untransformed predictors | 0.6830835 | 0.6814156 | 672897.5 | 723299.7 | NA |
We then add quadratic terms of the predictors to the model, and whether a certain quadratic term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with included quadratic terms is presented below.
# Add quadratic terms
lm11 <- lm(last_sold_price ~ year_built + I(year_built^2) +
bathrooms + I(bathrooms^2) +
bedrooms + I(bedrooms^2) +
last_sold_date + I(as.numeric(last_sold_date)^2) +
home_size +
home_type +
tax_year +
tax_value + I(tax_value^2) +
as.character(zipcode) +
location_cluster,
data = sf_data_impute_train)
model <- "Model with imputed untransformed predictors and quadratic terms"
R_square <- summary(lm11)$r.squared
adj_R_square <- summary(lm11)$adj.r.squared
RMSE_train <- RMSE(sf_data_impute_train$last_sold_price, predict(lm11, sf_data_impute_train))
RMSE_test <- RMSE(sf_data_impute_test$last_sold_price, predict(lm11, sf_data_impute_test))
p_lrtest <- lrtest(lm10, lm11)$`Pr(>Chisq)`[2]
Model_results_imputed_sf <- bind_rows(Model_results_imputed_sf, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
Model_results_imputed_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with untransformed y and imputed untransformed predictors | 0.6830835 | 0.6814156 | 672897.5 | 723299.7 | NA |
| Model with imputed untransformed predictors and quadratic terms | 0.7119140 | 0.7102841 | 641560.4 | 697334.7 | 0 |
Two-way interaction terms of the predictors are added into the model, and whether a certain interaction term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with quadratic and two-way interaction terms is presented below.
# Add interactions
lm12 <- lm(last_sold_price ~
year_built + I(year_built^2) +
bathrooms + I(bathrooms^2) +
bedrooms + I(bedrooms^2) +
last_sold_date + I(as.numeric(last_sold_date)^2) +
home_size +
home_type +
tax_year +
tax_value + I(tax_value^2) +
as.character(zipcode) +
location_cluster +
year_built:last_sold_date + year_built:tax_value +
bathrooms:last_sold_date + bathrooms:home_size +
bedrooms:last_sold_date + bedrooms:home_size + bedrooms:tax_value +
last_sold_date:home_size + last_sold_date:tax_year +
home_size:tax_value +
tax_year:tax_value +
home_type:year_built + home_type:bathrooms + home_type:bedrooms +
home_type:last_sold_date + home_type:tax_value +
location_cluster:bathrooms + location_cluster:bedrooms +
location_cluster:last_sold_date + location_cluster:home_size +
location_cluster:tax_value,
data = sf_data_impute_train)
model <- "Model with imputed untransformed predictors, quadratic and interaction terms"
R_square <- summary(lm12)$r.squared
adj_R_square <- summary(lm12)$adj.r.squared
RMSE_train <- RMSE(sf_data_impute_train$last_sold_price, predict(lm12, sf_data_impute_train))
## Warning in predict.lm(lm12, sf_data_impute_train): prediction from a rank-
## deficient fit may be misleading
RMSE_test <- RMSE(sf_data_impute_test$last_sold_price, predict(lm12, sf_data_impute_test))
## Warning in predict.lm(lm12, sf_data_impute_test): prediction from a rank-
## deficient fit may be misleading
p_lrtest <- lrtest(lm11, lm12)$`Pr(>Chisq)`[2]
Model_results_imputed_sf <- bind_rows(Model_results_imputed_sf, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
Model_results_imputed_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with untransformed y and imputed untransformed predictors | 0.6830835 | 0.6814156 | 672897.5 | 723299.7 | NA |
| Model with imputed untransformed predictors and quadratic terms | 0.7119140 | 0.7102841 | 641560.4 | 697334.7 | 0 |
| Model with imputed untransformed predictors, quadratic and interaction terms | 0.7878481 | 0.7835523 | 550554.0 | 712090.5 | 0 |
Higher order terms of the predictors are added into the model, and whether a certain higher order term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with quadratic, two-way interaction and higher order terms is presented below.
# Add higher order terms
lm13 <- lm(last_sold_price ~
year_built + I(year_built^2) +
bathrooms + I(bathrooms^2) +
bedrooms + I(bedrooms^2) +
last_sold_date + I(as.numeric(last_sold_date)^2) + I(as.numeric(last_sold_date)^3) +
home_size +
home_type +
tax_year +
tax_value + I(tax_value^2) +
as.character(zipcode) +
location_cluster +
year_built:last_sold_date + year_built:tax_value +
bathrooms:last_sold_date + bathrooms:home_size +
bedrooms:last_sold_date + bedrooms:home_size + bedrooms:tax_value +
last_sold_date:home_size + last_sold_date:tax_year +
home_size:tax_value +
tax_year:tax_value +
home_type:year_built + home_type:bathrooms + home_type:bedrooms +
home_type:last_sold_date + home_type:tax_value +
location_cluster:bathrooms + location_cluster:bedrooms +
location_cluster:last_sold_date + location_cluster:home_size +
location_cluster:tax_value,
data = sf_data_impute_train)
model <- "Model with imputed untransformed predictors, quadratic, interaction and higher order terms"
R_square <- summary(lm13)$r.squared
adj_R_square <- summary(lm13)$adj.r.squared
RMSE_train <- RMSE(sf_data_impute_train$last_sold_price, predict(lm13, sf_data_impute_train))
## Warning in predict.lm(lm13, sf_data_impute_train): prediction from a rank-
## deficient fit may be misleading
RMSE_test <- RMSE(sf_data_impute_test$last_sold_price, predict(lm13, sf_data_impute_test))
## Warning in predict.lm(lm13, sf_data_impute_test): prediction from a rank-
## deficient fit may be misleading
p_lrtest <- lrtest(lm12, lm13)$`Pr(>Chisq)`[2]
Model_results_imputed_sf <- bind_rows(Model_results_imputed_sf, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
Model_results_imputed_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with untransformed y and imputed untransformed predictors | 0.6830835 | 0.6814156 | 672897.5 | 723299.7 | NA |
| Model with imputed untransformed predictors and quadratic terms | 0.7119140 | 0.7102841 | 641560.4 | 697334.7 | 0 |
| Model with imputed untransformed predictors, quadratic and interaction terms | 0.7878481 | 0.7835523 | 550554.0 | 712090.5 | 0 |
| Model with imputed untransformed predictors, quadratic, interaction and higher order terms | 0.7889528 | 0.7846622 | 549118.7 | 700831.2 | 0 |
According to the results above, we could see that this linear regression model on the imputed data set doesn’t fit as well as the model based on the complete data set (\(R^2\) = 0.78 < 0.80), so we won’t explore further into this model.
The reason why this model based on imputed data set doesn’t fit as well as the model before may be that the imputed values shrink down the variance of the predictors so that the predictors can’t explain as much variance of the outcome as before; other reasons may be that the method of imputation introduces bias to the data, thus leading to the not as good performance of the model.
We then do the same thing for New York with the imputed data set.
set.seed(1)
inTrain_ny <- createDataPartition(y=ny_data_impute$last_sold_price, p=0.9)
ny_data_impute_train <- slice(ny_data_impute, inTrain_ny$Resample1)
ny_data_impute_test <- slice(ny_data_impute, -inTrain_ny$Resample1)
We fit the model with untransformed outcome and predictors and the result is presented below.
lm14<- lm(last_sold_price ~
year_built +
bathrooms +
bedrooms +
last_sold_date +
home_size +
home_type +
tax_year +
tax_value +
as.character(zipcode) +
location_cluster,
data = ny_data_impute_train)
model <- "Model with untransformed y and imputed untransformed predictors"
R_square <- summary(lm14)$r.squared
adj_R_square <- summary(lm14)$adj.r.squared
RMSE_train <- RMSE(ny_data_impute_train$last_sold_price, predict(lm14, ny_data_impute_train))
## Warning in predict.lm(lm14, ny_data_impute_train): prediction from a rank-
## deficient fit may be misleading
RMSE_test <- RMSE(ny_data_impute_test$last_sold_price, predict(lm14, ny_data_impute_test))
## Warning in predict.lm(lm14, ny_data_impute_test): prediction from a rank-
## deficient fit may be misleading
Model_results_imputed_ny <- data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = NA)
Model_results_imputed_ny %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with untransformed y and imputed untransformed predictors | 0.2303367 | 0.2063354 | 1443703 | 1015142 | NA |
We then add quadratic terms of the predictors to the model, and whether a certain quadratic term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with included quadratic terms is presented below.
# Add quadratic terms
lm15 <- lm(last_sold_price ~
year_built + I(year_built^2) +
bathrooms + I(bathrooms^2) +
bedrooms + I(bedrooms^2) +
last_sold_date +
home_size + I(home_size^2) +
home_type +
tax_year + I(tax_year^2) +
tax_value + I(tax_value^2) +
as.character(zipcode) +
location_cluster,
data = ny_data_impute_train)
model <- "Model with imputed untransformed predictors and quadratic terms"
R_square <- summary(lm15)$r.squared
adj_R_square <- summary(lm15)$adj.r.squared
RMSE_train <- RMSE(ny_data_impute_train$last_sold_price, predict(lm15, ny_data_impute_train))
## Warning in predict.lm(lm15, ny_data_impute_train): prediction from a rank-
## deficient fit may be misleading
RMSE_test <- RMSE(ny_data_impute_test$last_sold_price, predict(lm15, ny_data_impute_test))
## Warning in predict.lm(lm15, ny_data_impute_test): prediction from a rank-
## deficient fit may be misleading
p_lrtest <- lrtest(lm14, lm15)$`Pr(>Chisq)`[2]
Model_results_imputed_ny <- bind_rows(Model_results_imputed_ny, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
## Warning in rbind_all(x, .id): Unequal factor levels: coercing to character
Two-way interaction terms of the predictors are added into the model, and whether a certain interaction term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with quadratic and two-way interaction terms is presented below.
# Add interaction terms
lm16 <- lm(last_sold_price ~
year_built + I(year_built^2) +
bathrooms + I(bathrooms^2) +
bedrooms + I(bedrooms^2) +
last_sold_date +
home_size + I(home_size^2) +
home_type +
tax_year + I(tax_year^2) +
tax_value + I(tax_value^2) +
as.character(zipcode) +
location_cluster +
year_built:tax_value +
bathrooms:bedrooms +
home_size:tax_value +
home_type:year_built + home_type:bedrooms + home_type:home_size +
home_type:tax_value +
as.character(zipcode):year_built + as.character(zipcode):bathrooms +
as.character(zipcode):home_size,
data = ny_data_impute_train)
model <- "Model with imputed untransformed predictors, quadratic and interaction terms"
R_square <- summary(lm16)$r.squared
adj_R_square <- summary(lm16)$adj.r.squared
RMSE_train <- RMSE(ny_data_impute_train$last_sold_price, predict(lm16, ny_data_impute_train))
## Warning in predict.lm(lm16, ny_data_impute_train): prediction from a rank-
## deficient fit may be misleading
RMSE_test <- RMSE(ny_data_impute_test$last_sold_price, predict(lm16, ny_data_impute_test))
## Warning in predict.lm(lm16, ny_data_impute_test): prediction from a rank-
## deficient fit may be misleading
p_lrtest <- lrtest(lm15, lm16)$`Pr(>Chisq)`[2]
Model_results_imputed_ny <- bind_rows(Model_results_imputed_ny, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
Higher order terms of the predictors are added into the model, and whether a certain higher order term should be included in the model is determined by the inclusion criteria listed at the beginning of this part. The model with quadratic, two-way interaction and higher order terms is presented below.
# Add Higher Order Terms
lm17 <- lm(last_sold_price ~
year_built + I(year_built^2) +
bathrooms + I(bathrooms^2) +
bedrooms + I(bedrooms^2) +
last_sold_date +
home_size + I(home_size^2) +
home_type +
tax_year + I(tax_year^2) +
tax_value + I(tax_value^2) + I(tax_value^3) +
as.character(zipcode) +
location_cluster +
year_built:tax_value +
bathrooms:bedrooms +
home_size:tax_value +
home_type:year_built + home_type:bedrooms + home_type:home_size +
home_type:tax_value +
as.character(zipcode):year_built + as.character(zipcode):bathrooms +
as.character(zipcode):home_size,
data = ny_data_impute_train)
model <- "Model with imputed untransformed predictors, quadratic, interaction and higher order terms"
R_square <- summary(lm17)$r.squared
adj_R_square <- summary(lm17)$adj.r.squared
RMSE_train <- RMSE(ny_data_impute_train$last_sold_price, predict(lm17, ny_data_impute_train))
## Warning in predict.lm(lm17, ny_data_impute_train): prediction from a rank-
## deficient fit may be misleading
RMSE_test <- RMSE(ny_data_impute_test$last_sold_price, predict(lm17, ny_data_impute_test))
## Warning in predict.lm(lm17, ny_data_impute_test): prediction from a rank-
## deficient fit may be misleading
p_lrtest <- lrtest(lm16, lm17)$`Pr(>Chisq)`[2]
Model_results_imputed_ny <- bind_rows(Model_results_imputed_ny, data.frame(Model = model, R_square = R_square, adjusted_R_square = adj_R_square, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = p_lrtest))
Model_results_imputed_ny %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with untransformed y and imputed untransformed predictors | 0.2303367 | 0.2063354 | 1443703.4 | 1015141.9 | NA |
| Model with imputed untransformed predictors and quadratic terms | 0.2571315 | 0.2333196 | 1418350.5 | 977254.5 | 0 |
| Model with imputed untransformed predictors, quadratic and interaction terms | 0.6291848 | 0.5831948 | 1002089.1 | 735410.5 | 0 |
| Model with imputed untransformed predictors, quadratic, interaction and higher order terms | 0.6307531 | 0.5848941 | 999967.8 | 729857.1 | 0 |
According to the results above, we could see that this linear regression model on the imputed data set fits better than the model based on the complete data set (\(R^2\) = 0.59 > 0.53).
Then we explore the variables of importance in the above model.
varimp_ny <- varImp(lm17, scale = T)
varimp_ny <- data.frame(variable = rownames(varimp_ny), varimp_ny)
rownames(varimp_ny) <- 1:length(varimp_ny$variable)
varimp_ny <- varimp_ny %>%
arrange(desc(Overall)) %>%
mutate(Overall = Overall / sum(varimp_ny$Overall) * 100) %>%
slice(1:20)
varimp_ny %>%
ggplot(aes(x=1:dim(.)[1], y = Overall)) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_blank()) +
geom_line(color = "darkgrey") +
geom_point(color = "grey") +
scale_x_continuous(breaks=1:dim(varimp_sf)[1],
labels=varimp_ny$variable) +
labs(title = "Variable of Importance in New York",
y = "Variance explained(%)")
From the above plot, we could see that tax_year, tax_value and interaction between some of the zipcode and home_size, interaction between some of the zipcode and bathrooms are all important variables that explained the most variance of the outcome in this model.
Since the \(R^2\) of those linear regression models are not quite high thus producing relatively large RMSE, we then explore different models to see whether we could get better prediction results.
We use the same train set and test set as the linear regression model to input into Random Forest.
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
sf_data_impute_rf_train <- sf_data_impute_train %>%
mutate(last_sold_date = as.numeric(last_sold_date),
home_type = as.factor(home_type),
zipcode = as.factor(zipcode))
sf_data_impute_rf_test <- sf_data_impute_test %>%
mutate(last_sold_date = as.numeric(last_sold_date),
home_type = as.factor(home_type),
zipcode = as.factor(zipcode))
rf1 <- randomForest(last_sold_price ~ year_built +
bathrooms +
bedrooms +
last_sold_date +
home_size +
home_type +
tax_year +
tax_value +
zipcode,
data = sf_data_impute_rf_train, ntree=500, nodesize = 50, mtry=1)
RMSE_train <- RMSE(predict(rf1, sf_data_impute_rf_train), sf_data_impute_rf_train$last_sold_price)
levels(sf_data_impute_rf_test$zipcode) <- levels(sf_data_impute_rf_train$zipcode)
levels(sf_data_impute_rf_test$last_sold_date) <- levels(sf_data_impute_rf_train$last_sold_date)
levels(sf_data_impute_rf_test$home_type) <- levels(sf_data_impute_rf_train$home_type)
RMSE_test <- RMSE(predict(rf1, sf_data_impute_rf_test[, 1:9]), sf_data_impute_rf_test$last_sold_price)
Model_results_sf <- bind_rows(Model_results_sf, data.frame(Model = "Random Forest", R_square = NA, adjusted_R_square = NA, RMSE_train = RMSE_train, RMSE_test = RMSE_test, p_lrtest_vs_last_model = NA))
Model_results_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with log transformed y and untransformed predictors | 0.6157989 | 0.6137994 | 1757179.7 | 1632475.3 | NA |
| Model with centered years | 0.7157573 | 0.7142780 | 632535.3 | 527563.0 | NA |
| Model with untransformed y and untransformed predictors | 0.7157573 | 0.7142780 | 632535.3 | 527583.3 | NA |
| Model with quadratic terms contributing to the prediction results | 0.7298357 | 0.7283525 | 616671.9 | 509283.3 | 0.0000000 |
| Model with quadratic and interaction terms contributing to the prediction results | 0.8023154 | 0.7980183 | 527505.0 | 497166.2 | 0.0000000 |
| Model with quadratic, interaction and higher order terms contributing to the prediction results | 0.8024080 | 0.7980935 | 527381.4 | 497195.4 | 0.0256473 |
| Random Forest | NA | NA | 621467.2 | 802300.7 | NA |
We see that the same model with untransformed outcome and untransformed predictors doesn’t provide us with smaller RMSE. Therefore we turn to Neural Network to see whether we could get better prediction results.
We use the same train set and test set as before to input into neural network for the convenience of later comparison between the performance of the models. We tried many different parameters in the neural network and here, here we only present the model providing us with the best prediction result (smallest RMSE in the test set and train set) since the runtime of neural net work is very long.
The parameter space that we have tried for neural network is presented as below. We have tuned the parameters as the combination of different values.
hidden: 2, 3, 4, c(5, 3), c(6,3)stepmax: 1e5, 2e5We have also run this model on both the complete data set and the imputed data set, we found that the complete data set produced better result.
library(neuralnet)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
##
## compute
library(tidyr)
#sf_neural <- sf_data
sf_neural <- read.csv("data/sf_neural.csv")
#colnames(sf_data)
sf_neural <- sf_neural %>% dplyr::mutate(home_type_cate = as.numeric(as.factor(sf_neural$home_type)))
sf_neural <- sf_neural %>% dplyr::select(-list, -url, -property_size, -log_last_sold_price, -log_tax_value, -log_home_size, -log_property_size, -home_type, -zillow_id)
sf_neural$last_sold_date<- as.numeric(sf_neural$last_sold_date)
sf_neural$location_cluster<-as.numeric(sf_neural$location_cluster)
sf_neural$zipcode <-as.numeric(as.factor(sf_neural$zipcode))
sf_neural <- sf_neural %>% dplyr::select(c(zipcode,year_built,bathrooms,bedrooms,last_sold_date,tax_year,tax_value,home_size,home_type_cate,last_sold_price))
set.seed(1)
index <- sample(1:nrow(sf_neural),nrow(sf_neural)*0.8)
maxs <- apply(sf_neural,2,max)
mins <- apply(sf_neural,2,min)
scaled <- as.data.frame(scale(as.matrix(sf_neural), center=mins, scale=maxs-mins))
train<-scaled[index,]
test<- scaled[-index,]
n <- names(train)
f<-as.formula(paste("last_sold_price ~", paste(n[!n%in% "last_sold_price"],collapse = "+")))
nn_sf <- neuralnet(f, data=train,hidden=c(4), stepmax = 2e5, linear.output = T)
#nn_sf <- neuralnet(last_sold_price~zipcode+year_built+bathrooms+bedrooms+last_sold_date+tax_value+home_size, data=train,hidden=c(5,3), linear.output = T)
# This plot couldn't be presented in knited html directly
# We mannually include this in the html
plot(nn_sf)
pr.nn <- compute(nn_sf,test[,1:9])
pr.nn <- pr.nn$net.result*(max(sf_neural$last_sold_price) - min(sf_neural$last_sold_price))+min(sf_neural$last_sold_price)
test.r <- (test$last_sold_price) * (max(sf_neural$last_sold_price)-min(sf_neural$last_sold_price))+min(sf_neural$last_sold_price)
# TEST
MSE_test <- (RMSE(test.r, pr.nn))^2
RMSE_test_nn <- (sum(test.r - pr.nn)^2/nrow(test))^(1/2)
prt.nn <- compute(nn_sf,train[,1:9])
prt.nn <- prt.nn$net.result*(max(sf_neural$last_sold_price) - min(sf_neural$last_sold_price))+min(sf_neural$last_sold_price)
train.r <- (train$last_sold_price) * (max(sf_neural$last_sold_price)-min(sf_neural$last_sold_price))+min(sf_neural$last_sold_price)
# TRAIN
MSE_train <- (RMSE(train.r, prt.nn))^2
RMSE_train_nn <- (sum(train.r-prt.nn)^2/nrow(train))^(1/2)
We then compare the prediction and the actual observation in the train and test set.
plot(test.r, main="Comparsion between predicted value (in Green) \n and actual value (in Black) in test set for SF",
xlab="property index", ylab="property price")
lines(pr.nn,col='green')
plot(train.r, main="Comparsion between predicted value (in Red) \n and actual value (in Black) in train set for SF",
xlab="property index", ylab="property price")
lines(prt.nn,col=2)
#sf_original_data <- sf_data
sf_original_data <- read.csv("data/sf_neural.csv") %>% dplyr::select(c(zillow_id,latitude,longitude,year_built,bathrooms,bedrooms,
last_sold_date,last_sold_price,home_type,tax_year,
tax_value,home_size,zipcode,list,url,property_size))
sf_pred_test <- sf_original_data %>% slice(as.numeric(row.names(pr.nn))) %>% mutate(prediction = pr.nn)
sf_pred_train <- sf_original_data %>% slice(as.numeric(row.names(prt.nn))) %>% mutate(prediction = prt.nn)
sf_pred <-rbind(sf_pred_test,sf_pred_train)
#write.csv(sf_pred, "sf_prediction.csv",row.names = FALSE)
From the above figures, we do see that neural network produces good prediction results. We further compare the RMSE produced by this model and those produced by the previous models.
Model_results_sf <- dplyr::bind_rows(Model_results_sf, Model_results_imputed_sf, data.frame(Model = "Neural Network", R_square = "NA", adjusted_R_square = "NA", RMSE_train = RMSE_train_nn, RMSE_test = RMSE_test_nn, p_lrtest_vs_last_model = "NA"))
Model_results_sf %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with log transformed y and untransformed predictors | 0.6157989073 | 0.6137993747 | 1757179.685387 | 1632475.310965 | NA |
| Model with centered years | 0.7157573483 | 0.7142780385 | 632535.316264 | 527562.999747 | NA |
| Model with untransformed y and untransformed predictors | 0.7157573483 | 0.7142780385 | 632535.316264 | 527583.280399 | NA |
| Model with quadratic terms contributing to the prediction results | 0.7298357019 | 0.7283525473 | 616671.865284 | 509283.297487 | 0.0000000000 |
| Model with quadratic and interaction terms contributing to the prediction results | 0.8023153959 | 0.7980183179 | 527504.998286 | 497166.225614 | 0.0000000000 |
| Model with quadratic, interaction and higher order terms contributing to the prediction results | 0.8024080322 | 0.7980935481 | 527381.387736 | 497195.408575 | 0.0256473269 |
| Random Forest | NA | NA | 621467.173121 | 802300.715977 | NA |
| Model with untransformed y and imputed untransformed predictors | 0.6830834702 | 0.6814156195 | 672897.547488 | 723299.684783 | NA |
| Model with imputed untransformed predictors and quadratic terms | 0.7119140340 | 0.7102841275 | 641560.401798 | 697334.667197 | 0.0000000000 |
| Model with imputed untransformed predictors, quadratic and interaction terms | 0.7878481401 | 0.7835523355 | 550553.998681 | 712090.522619 | 0.0000000000 |
| Model with imputed untransformed predictors, quadratic, interaction and higher order terms | 0.7889528334 | 0.7846622309 | 549118.736453 | 700831.168914 | 0.0000000000 |
| Neural Network | 1.0000000000 | 1.0000000000 | 2250.432175 | 6579.316389 | 1.0000000000 |
Model_results_sf <- Model_results_sf %>% gather(train_test, RMSE, `RMSE_train`:`RMSE_test`)
ggplot(Model_results_sf, aes(x=Model, y= RMSE, group = train_test, color = train_test)) +
geom_point()+
geom_line()+
xlab("Model") +
ylab("RMSE") +ggtitle("Comparing RMSE for different models")+ theme(axis.text.x = element_text(angle = 20, vjust = 1, hjust = 1))
Then we do the same thing for New York.
#ny_neural <- ny_data_impute
ny_neural<- read.csv("data/ny_neural.csv") %>% filter(home_size>20 & home_size < 1e5)
#ny_neural<- ny_data
ny_neural <- ny_neural %>% dplyr::mutate(home_type_cate = as.numeric(as.factor(ny_neural$home_type)))
ny_neural <- ny_neural %>% dplyr::select(-list, -url, -property_size, -log_last_sold_price, -log_tax_value, -log_home_size, -log_property_size, -home_type, -zillow_id)
ny_neural$last_sold_date<- as.numeric(ny_neural$last_sold_date)
ny_neural$location_cluster<-as.numeric(ny_neural$location_cluster)
ny_neural$zipcode <-as.numeric(as.factor(ny_neural$zipcode))
ny_neural <- ny_neural %>% dplyr::select(c(location_cluster,year_built,bathrooms,bedrooms,last_sold_date,tax_year,tax_value,home_size,home_type_cate,last_sold_price))
set.seed(1)
index <- sample(1:nrow(ny_neural),nrow(ny_neural)*0.8)
maxs <- apply(ny_neural,2,max)
mins <- apply(ny_neural,2,min)
scaled <- as.data.frame(scale(as.matrix(ny_neural), center=mins, scale=maxs-mins))
train<-scaled[index,]
test<- scaled[-index,]
n <- names(train)
f<-as.formula(paste("last_sold_price ~", paste(n[!n%in% "last_sold_price"],collapse = "+")))
nn_ny <- neuralnet(f, data=train,hidden=c(3), linear.output = T)
#nn_ny <- neuralnet(last_sold_price~zipcode+year_built+bathrooms+bedrooms+last_sold_date+tax_value+home_size, data=train,hidden=c(5,3), linear.output = T)
# This plot couldn't be presented in knited html directly
# We mannually include this in the html
plot(nn_ny)
pr.nn <- compute(nn_ny,test[,1:9])
pr.nn <- pr.nn$net.result*(max(ny_neural$last_sold_price) - min(ny_neural$last_sold_price))+min(ny_neural$last_sold_price)
test.r <- (test$last_sold_price) * (max(ny_neural$last_sold_price)-min(ny_neural$last_sold_price))+min(ny_neural$last_sold_price)
MSE_test <- (RMSE(test.r, pr.nn))^2
RMSE_test_nn <- (sum(test.r - pr.nn)^2/nrow(test))^(1/2)
prt.nn <- compute(nn_ny,train[,1:9])
prt.nn <- prt.nn$net.result*(max(ny_neural$last_sold_price) - min(ny_neural$last_sold_price))+min(ny_neural$last_sold_price)
train.r <- (train$last_sold_price) * (max(ny_neural$last_sold_price)-min(ny_neural$last_sold_price))+min(ny_neural$last_sold_price)
MSE_train <- (RMSE(train.r, prt.nn))^2
RMSE_train_nn <- (sum(train.r-prt.nn)^2/nrow(train))^(1/2)
plot(test.r, main="Comparsion between predicted value (in Green) \n and actual value (in Black) in test set for ny",
xlab="property index", ylab="property price")
lines(pr.nn,col='green')
plot(train.r, main="Comparsion between predicted value (in Red) \n and actual value (in Black) in train set for ny",
xlab="property index", ylab="property price")
lines(prt.nn,col=2)
#ny_original_data <- ny_data
ny_original_data <- read.csv("data/ny_neural.csv") %>% dplyr::select(c(zillow_id,latitude,longitude,year_built,bathrooms,bedrooms,
last_sold_date,last_sold_price,home_type,tax_year,
tax_value,home_size,zipcode,list,url,property_size))
ny_pred_test <- ny_original_data %>% slice(as.numeric(row.names(pr.nn))) %>% mutate(prediction = pr.nn)
ny_pred_train <- ny_original_data %>% slice(as.numeric(row.names(prt.nn))) %>% mutate(prediction = prt.nn)
ny_pred <-rbind(ny_pred_test,ny_pred_train)
#write.csv(ny_pred, "ny_prediction.csv",row.names = FALSE)
From the above figures, we do see that neural network produces good prediction results. We further compare the RMSE produced by this model and those produced by the previous models.
Model_results_ny <- bind_rows(Model_results_ny, Model_results_imputed_ny, data.frame(Model = "Neural Network", R_square = "NA", adjusted_R_square = "NA", RMSE_train = RMSE_train_nn, RMSE_test = RMSE_test_nn, p_lrtest_vs_last_model = "NA"))
Model_results_ny %>% kable
| Model | R_square | adjusted_R_square | RMSE_train | RMSE_test | p_lrtest_vs_last_model |
|---|---|---|---|---|---|
| Model with untransformed y and untransformed predictors | 0.3276205665 | 0.2916757540 | 1316364.199647 | 967284.22112 | NA |
| Model with quadratic terms contributing to the prediction results | 0.3495884300 | 0.3144587018 | 1294681.615819 | 963388.50693 | 0.0000000000 |
| Model with quadratic and interaction terms contributing to the prediction results | 0.5973370601 | 0.5305379877 | 1018684.462390 | 863657.09558 | 0.0000000000 |
| Model with untransformed y and imputed untransformed predictors | 0.2303366718 | 0.2063353712 | 1443703.398577 | 1015141.86157 | NA |
| Model with imputed untransformed predictors and quadratic terms | 0.2571314913 | 0.2333195947 | 1418350.466376 | 977254.52823 | 0.0000000000 |
| Model with imputed untransformed predictors, quadratic and interaction terms | 0.6291847773 | 0.5831948324 | 1002089.147511 | 735410.51332 | 0.0000000000 |
| Model with imputed untransformed predictors, quadratic, interaction and higher order terms | 0.6307530716 | 0.5848940733 | 999967.827049 | 729857.14822 | 0.0000000243 |
| Neural Network | 1.0000000000 | 1.0000000000 | 7830.418536 | 69171.42403 | 1.0000000000 |
Model_results_ny <- Model_results_ny %>% gather(train_test, RMSE, `RMSE_train`:`RMSE_test`)
ggplot(Model_results_ny, aes(x=Model, y= RMSE, group = train_test, color = train_test)) +
geom_point()+
geom_line()+
xlab("Model") +
ylab("RMSE") +ggtitle("Comparing RMSE for different models")+ theme(axis.text.x = element_text(angle = 10, vjust = 1, hjust = 1))
We compared the prediction and the actual last sold price in the data set to see the performance of our model. Since the prediction output by neural network varies randomly from time to time, we output one of the prediction results to a csv file and read the prediction results now to evaluate the performance of our model.
# Comparing predicted price and last sold price for all the properties in SF
sf_perform <-read.csv("data/sf_prediction.csv")
predict_sf_plot <- ggplot() +geom_point(data =sf_perform, aes(x=row.names(sf_perform), y= prediction, color="red"), alpha = 0.4,color = "red") + xlab("properties")+ylab("predicted price") + theme(axis.text.x=element_blank())+ggtitle("Predicted price for all the properties in SF")
sold_sf_plot <- ggplot() + geom_point(data = sf_perform, aes(x=row.names(sf_perform), y= last_sold_price), alpha = 0.4,color = "blue")+ xlab("properties")+ylab("last sold price") + theme(axis.text.x=element_blank())+ggtitle("Last sold price for all the properties in SF")
grid.arrange(predict_sf_plot, sold_sf_plot)
# Comparing predicted price and last sold price for all the properties in NYC
ny_perform <-read.csv("data/ny_prediction.csv")
predict_ny_plot <- ggplot() +geom_point(data =ny_perform, aes(x=row.names(ny_perform), y= prediction, color="red"), alpha = 0.4,color = "red") + xlab("properties")+ylab("predicted price") + theme(axis.text.x=element_blank())+ggtitle("Predicted price for all the properties in NYC")
sold_ny_plot <- ggplot() + geom_point(data = ny_perform, aes(x=row.names(ny_perform), y= last_sold_price), alpha = 0.4,color = "blue")+ xlab("properties")+ylab("last sold price") + theme(axis.text.x=element_blank())+ggtitle("Last sold price for all the properties in NYC")
grid.arrange(predict_ny_plot, sold_ny_plot)
We do see that in both cities we got very close results to the actual prices. In San Francisco, we see that neural network captures the major trends in the data, and we believe that with more complex model, we will get better prediction results from the model.
We then compare the prediction of our model to the current listing of the properties to see whether our model could provide insights into current listing of the properties and help us to decide whether a certain property is worth buying.
library(ggmap)
library(gridExtra)
#read in the prediction result by previous model and
#we are interested in only current listing properties
sf_prediction<- read.csv("data/sf_prediction.csv") %>% filter(!is.na(list)) %>% filter(prediction!=0)
#convert the listing price to numeric for comparision
listing <- gsub("\\$","",sf_prediction$list)
listing <- gsub(",","",listing)
listing <- as.numeric(listing)
## Warning: NAs introduced by coercion
sf_prediction <- sf_prediction %>% mutate(listing = listing) %>% filter(!is.na(listing))
#plot the current listing price on map, use color to demonstrate the price, and use size to show the
#home size of each property
sf_listing_price <-qmplot(longitude, latitude, data = sf_prediction, colour=sf_prediction$listing, size = home_size)+scale_colour_gradient(low = "red",high="green",guide= guide_legend(title = "listing price range")) + ggtitle("San Francisco current listing property listing price")+theme(plot.title=element_text(family="Arial", face="bold", size=10))
#do the same thing for predicted price
sf_predict_price <-qmplot(longitude, latitude, data = sf_prediction, colour=sf_prediction$prediction, size = home_size)+scale_colour_gradient(low = "red",high="green",guide= guide_legend(title = "predicted price range")) + ggtitle("San Francisco current listing property predicted price")+theme(plot.title=element_text(family="Arial", face="bold", size=10))
sf_listing_price
sf_predict_price
#do the same thing for NY
ny_prediction<- read.csv("data/ny_prediction.csv") %>% filter(!is.na(list)) %>% filter(prediction!=0)
listing <- gsub("\\$","",ny_prediction$list)
listing <- gsub(",","",listing)
listing <- as.numeric(listing)
## Warning: NAs introduced by coercion
ny_prediction <- ny_prediction %>% mutate(listing = listing) %>% filter(!is.na(listing))
mid <- (max(ny_prediction$listing)-min(ny_prediction$listing))*(1/3)
ny_listing_price <-qmplot(longitude, latitude, data = ny_prediction, colour=ny_prediction$listing, size = home_size)+scale_color_gradient2(midpoint=mid, low="yellow", mid="blue",high="red", space ="Lab" ,guide= guide_legend(title = "listing price range")) + ggtitle("New York City current listing property listing price")+theme(plot.title=element_text(family="Arial", face="bold", size=10))
mid <- (max(ny_prediction$prediction)-min(ny_prediction$prediction))*(1/3)
ny_predict_price <-qmplot(longitude, latitude, data = ny_prediction, colour=ny_prediction$prediction, size = home_size)+scale_color_gradient2(midpoint=mid, low="yellow", mid="blue",high="red", space ="Lab" ,guide= guide_legend(title = "predicted price range")) + ggtitle("New York City current listing property predicted price")+theme(plot.title=element_text(family="Arial", face="bold", size=10))
ny_listing_price
ny_predict_price
From the above results, we see a similar color trend, which denotes the price of the properties, of the listing and predicted prices in both cities, indicating the good performance of our model. The similar color trend makes intuitive trend since the distribution of the prices of the properties won’t change significantly over time. Additionally, we see that though the color trend in the two figures in both cities are similar, we do see different colors in the two figures of New York while the colors of the two figures of San Francisco are almost the same, which indicates that the housing prices have increased a lot in New York in recent years while those in San Francisco almost remains the same in these years.